Cloud MapReduce Zaharia

Embed Size (px)

Citation preview

  • 8/2/2019 Cloud MapReduce Zaharia

    1/52

    UC Berkeley

    Introduction to MapReduce

    and Hadoop

    Matei Zaharia

    UC Berkeley RAD [email protected]

    mailto:[email protected]:[email protected]
  • 8/2/2019 Cloud MapReduce Zaharia

    2/52

    What is MapReduce?

    Data-parallel programming model forclusters of commodity machines

    Pioneered by Google Processes 20 PB of data per day

    Popularized by open-source Hadoop project

    Used by Yahoo!, Facebook, Amazon,

  • 8/2/2019 Cloud MapReduce Zaharia

    3/52

    What is MapReduce used for?

    At Google: Index building for Google Search

    Article clustering for Google News

    Statistical machine translation

    At Yahoo!: Index building for Yahoo! Search

    Spam detection for Yahoo! Mail

    At Facebook: Data mining

    Ad optimization

    Spam detection

  • 8/2/2019 Cloud MapReduce Zaharia

    4/52

    Example: Facebook Lexicon

    www.facebook.com/lexicon

    http://www.facebook.com/lexiconhttp://www.facebook.com/lexicon
  • 8/2/2019 Cloud MapReduce Zaharia

    5/52

    Example: Facebook Lexicon

    www.facebook.com/lexicon

    http://www.facebook.com/lexiconhttp://www.facebook.com/lexicon
  • 8/2/2019 Cloud MapReduce Zaharia

    6/52

    What is MapReduce used for?

    In research: Analyzing Wikipedia conflicts (PARC)

    Natural language processing (CMU)

    Bioinformatics (Maryland) Astronomical image analysis (Washington)

    Ocean climate simulation (Washington)

  • 8/2/2019 Cloud MapReduce Zaharia

    7/52

    Outline

    MapReduce architecture Fault tolerance in MapReduce

    Sample applications

    Getting started with Hadoop

    Higher-level languages on top of Hadoop:Pig and Hive

  • 8/2/2019 Cloud MapReduce Zaharia

    8/52

    MapReduce Design Goals

    1. Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/s = 23 days

    Scan on 1000-node cluster = 33 minutes

    2. Cost-efficiency: Commodity nodes (cheap, but unreliable)

    Commodity network

    Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

  • 8/2/2019 Cloud MapReduce Zaharia

    9/52

    Typical Hadoop Cluster

    Aggregation switch

    Rack switch

    40 nodes/rack, 1000-4000 nodes in cluster 1 GBps bandwidth in rack, 8 GBps out of rack

    Node specs (Yahoo terasort):8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

  • 8/2/2019 Cloud MapReduce Zaharia

    10/52

    Typical Hadoop Cluster

    Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

  • 8/2/2019 Cloud MapReduce Zaharia

    11/52

    Challenges

    Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years

    MTBF for 1000 nodes = 1 day

    Solution: Build fault-tolerance into system

    Commodity network = low bandwidth Solution: Push computation to the data

    Programming distributed systems is hard Solution: Data-parallel programming model: users

    write map and reduce functions, system handleswork distribution and fault tolerance

  • 8/2/2019 Cloud MapReduce Zaharia

    12/52

    Hadoop Components

    Distributed file system (HDFS) Single namespace for entire cluster

    Replicates data 3x for fault-tolerance

    MapReduce implementation Executes user jobs specified as map and

    reduce functions

    Manages work distribution & fault-tolerance

  • 8/2/2019 Cloud MapReduce Zaharia

    13/52

    Hadoop Distributed File System

    Files split into 128MB blocks

    Blocks replicated across

    several datanodes(usually 3)

    Single namenodestoresmetadata (file names, blocklocations, etc)

    Optimized for large files,sequential reads

    Files are append-only

    Namenode

    Datanodes

    1

    2

    34

    1

    2

    4

    2

    1

    3

    1

    4

    3

    3

    2

    4

    File1

  • 8/2/2019 Cloud MapReduce Zaharia

    14/52

    MapReduce Programming Model

    Data type: key-value records

    Map function:

    (Kin, Vin) list(Kinter, Vinter)

    Reduce function:

    (Kinter, list(Vinter)) list(Kout, Vout)

  • 8/2/2019 Cloud MapReduce Zaharia

    15/52

    Example: Word Count

    def mapper(line):

    foreach word in line.split():

    output(word, 1)

    def reducer(key, values):output(key, sum(values))

  • 8/2/2019 Cloud MapReduce Zaharia

    16/52

    Word Count Execution

    the quickbrown fox

    the fox atethe mouse

    how nowbrown cow

    Map

    Map

    Map

    Reduce

    Reduce

    brown, 2fox, 2how, 1now, 1the, 3

    ate, 1cow, 1

    mouse, 1quick, 1

    the, 1brown, 1

    fox, 1

    quick, 1

    the, 1fox, 1the, 1

    how, 1now, 1

    brown, 1ate, 1

    mouse, 1

    cow, 1

    Input Map Shuffle & Sort Reduce Output

  • 8/2/2019 Cloud MapReduce Zaharia

    17/52

    MapReduce Execution Details

    Single mastercontrols job execution onmultiple slavesas well as user scheduling

    Mappers preferentially placed on same node

    or same rack as their input block Push computation to data, minimize network use

    Mappers save outputs to local disk rather

    than pushing directly to reducers Allows having more reducers than nodes

    Allows recovery if a reducer crashes

  • 8/2/2019 Cloud MapReduce Zaharia

    18/52

    An Optimization: The Combiner

    def combiner(key, values):

    output(key, sum(values))

    A combiner is a local aggregation functionfor repeated keys produced by same map

    For associative ops. like sum, count, max

    Decreases size of intermediate data

    Example: local counting for Word Count:

  • 8/2/2019 Cloud MapReduce Zaharia

    19/52

    Word Count with Combiner

    Input Map & Combine Shuffle & Sort Reduce Output

    the quickbrown fox

    the fox atethe mouse

    how nowbrown cow

    Map

    Map

    Map

    Reduce

    Reduce

    brown, 2fox, 2how, 1now, 1the, 3

    ate, 1cow, 1

    mouse, 1quick, 1

    the, 1brown, 1

    fox, 1

    quick, 1

    the, 2fox, 1

    how, 1now, 1

    brown, 1ate, 1

    mouse, 1

    cow, 1

  • 8/2/2019 Cloud MapReduce Zaharia

    20/52

    Outline

    MapReduce architecture Fault tolerance in MapReduce

    Sample applications

    Getting started with Hadoop

    Higher-level languages on top of Hadoop:Pig and Hive

  • 8/2/2019 Cloud MapReduce Zaharia

    21/52

    Fault Tolerance in MapReduce

    1. If a task crashes: Retry on another node

    Okay for a map because it had no dependencies

    Okay for reduce because map outputs are on disk If the same task repeatedly fails, fail the job or

    ignore that input block (user-controlled)

    Note: For this and the other fault tolerancefeatures to work, your map and reducetasks must be side-effect-free

  • 8/2/2019 Cloud MapReduce Zaharia

    22/52

    Fault Tolerance in MapReduce

    2. If a node crashes: Relaunch its current tasks on other nodes

    Relaunch any maps the node previously ran

    Necessary because their output files were lostalong with the crashed node

  • 8/2/2019 Cloud MapReduce Zaharia

    23/52

    Fault Tolerance in MapReduce

    3. If a task is going slowly (straggler): Launch second copy of task on another node

    Take the output of whichever copy finishes

    first, and kill the other one

    Critical for performance in large clusters:

    stragglers occur frequently due to failinghardware, bugs, misconfiguration, etc

  • 8/2/2019 Cloud MapReduce Zaharia

    24/52

    Takeaways

    By providing a data-parallel programmingmodel, MapReduce can control jobexecution in useful ways:

    Automatic division of job into tasks Automatic placement of computation near data

    Automatic load balancing

    Recovery from failures & stragglers User focuses on application, not on

    complexities of distributed computing

  • 8/2/2019 Cloud MapReduce Zaharia

    25/52

    Outline

    MapReduce architecture Fault tolerance in MapReduce

    Sample applications

    Getting started with Hadoop

    Higher-level languages on top of Hadoop:Pig and Hive

  • 8/2/2019 Cloud MapReduce Zaharia

    26/52

    1. Search

    Input: (lineNumber, line) records

    Output: lines matching a given pattern

    Map:if(line matches pattern):

    output(line)

    Reduce: identify function

    Alternative: no reducer (map-only job)

  • 8/2/2019 Cloud MapReduce Zaharia

    27/52

    pigsheepyakzebra

    aardvarkant

    beecowelephant

    2. Sort

    Input: (key, value) records Output: same records, sorted by key

    Map: identity function Reduce: identify function

    Trick: Pick partitioningfunction h such thatk1 h(k1)

  • 8/2/2019 Cloud MapReduce Zaharia

    28/52

    3. Inverted Index

    Input: (filename, text) records Output: list of files containing each word

    Map:foreach word in text.split():

    output(word, filename)

    Combine: uniquify filenames for each word

    Reduce:def reduce(word, filenames):

    output(word, sort(filenames))

  • 8/2/2019 Cloud MapReduce Zaharia

    29/52

    Inverted Index Example

    to be ornot to be afraid, (12th.txt)

    be, (12th.txt, hamlet.txt)greatness, (12th.txt)not, (12th.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)

    hamlet.txt

    be not

    afraid ofgreatness

    12th.txt

    to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

    be, 12th.txtnot, 12th.txtafraid, 12th.txt

    of, 12th.txtgreatness, 12th.txt

  • 8/2/2019 Cloud MapReduce Zaharia

    30/52

    4. Most Popular Words

    Input: (filename, text) records Output: the 100 words occurring in most files

    Two-stage solution: Job 1:

    Create inverted index, giving (word, list(file)) records

    Job 2: Map each (word, list(file)) to (count, word)

    Sort these records by count as in sort job

    Optimizations: Map to (word, 1) instead of (word, file) in Job 1

    Estimate count distribution in advance by sampling

  • 8/2/2019 Cloud MapReduce Zaharia

    31/52

    Outline

    MapReduce architecture Fault tolerance in MapReduce

    Sample applications

    Getting started with Hadoop

    Higher-level languages on top of Hadoop:Pig and Hive

  • 8/2/2019 Cloud MapReduce Zaharia

    32/52

    Getting Started with Hadoop

    Download from hadoop.apache.org To install locally, unzip and set JAVA_HOME

    Details:hadoop.apache.org/core/docs/current/quickstart.html

    Three ways to write jobs:

    Java API

    Hadoop Streaming (for Python, Perl, etc)

    Pipes API (C++)

    http://hadoop.apache.org/corehttp://hadoop.apache.org/core/docs/current/quickstart.htmlhttp://hadoop.apache.org/core/docs/current/quickstart.htmlhttp://hadoop.apache.org/core
  • 8/2/2019 Cloud MapReduce Zaharia

    33/52

    Word Count in Java

    publicstaticclass MapClass extendsMapReduceBaseimplementsMapper {

    privatefinalstaticIntWritable ONE = newIntWritable(1);

    publicvoid map(LongWritable key, Text value,

    OutputCollector output,Reporter reporter) throwsIOException {

    String line = value.toString();

    StringTokenizer itr = newStringTokenizer(line);

    while(itr.hasMoreTokens()) {

    output.collect(newtext(itr.nextToken()), ONE);

    }}

    }

  • 8/2/2019 Cloud MapReduce Zaharia

    34/52

    Word Count in Java

    publicstaticclass Reduce extendsMapReduceBaseimplements Reducer {

    publicvoidreduce(Text key, Iterator values,

    OutputCollector output,

    Reporter reporter) throws IOException {

    int sum = 0;while(values.hasNext()) {

    sum += values.next().get();

    }

    output.collect(key, new IntWritable(sum));

    }

    }

  • 8/2/2019 Cloud MapReduce Zaharia

    35/52

    Word Count in Java

    publicstaticvoid main(String[] args) throwsException {JobConf conf = newJobConf(WordCount.class);

    conf.setJobName("wordcount");

    conf.setMapperClass(MapClass.class);

    conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);

    FileInputFormat.setInputPaths(conf, args[0]);

    FileOutputFormat.setOutputPath(conf, newPath(args[1]));

    conf.setOutputKeyClass(Text.class);// out keys are words (strings)

    conf.setOutputValueClass(IntWritable.class);// values are counts

    JobClient.runJob(conf);

    }

    Word Count in Python with

  • 8/2/2019 Cloud MapReduce Zaharia

    36/52

    Word Count in Python withHadoop Streaming

    import sys

    for line in sys.stdin:

    for word in line.split():

    print(word.lower() + "\t" + 1)

    import sys

    counts = {}

    for line in sys.stdin:

    word, count = line.split("\t")

    dict[word] = dict.get(word, 0) + int(count)

    for word, count in counts:

    print(word.lower() + "\t" + 1)

    Mapper.py:

    Reducer.py:

  • 8/2/2019 Cloud MapReduce Zaharia

    37/52

    Outline

    MapReduce architecture Fault tolerance in MapReduce

    Sample applications

    Getting started with Hadoop

    Higher-level languages on top of Hadoop:Pig and Hive

  • 8/2/2019 Cloud MapReduce Zaharia

    38/52

    Motivation

    MapReduce is great, as many algorithmscan be expressed by a series of MR jobs

    But its low-level: must think about keys,values, partitioning, etc

    Can we capture common job patterns?

  • 8/2/2019 Cloud MapReduce Zaharia

    39/52

    Pig

    Started at Yahoo! Research Now runs about 30% of Yahoo!s jobs

    Features:

    Expresses sequences of MapReduce jobs

    Data model: nested bags of items

    Provides relational (SQL) operators

    (JOIN, GROUP BY, etc) Easy to plug in Java functions

    Pig Pen dev. env. for Eclipse

  • 8/2/2019 Cloud MapReduce Zaharia

    40/52

    An Example Problem

    Suppose you haveuser data in onefile, website data in

    another, and youneed to find the top5 most visited

    pages by usersaged 18 - 25.

    Load Users Load Pages

    Filter by age

    Join on name

    Group on url

    Count clicks

    Order by clicks

    Take top 5

    Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

  • 8/2/2019 Cloud MapReduce Zaharia

    41/52

    In MapReduce

    i m p o r t j a v a . i o . I O E x c e p t i o n ;

    i m p o r t j a v a . u t i l . A r r a y L i s t ;i m p o r t j a v a . u t i l . I t e r a t o r ;i m p o r t j a v a . u t i l . L i s t ;i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;p u b l i c c l a s s M R E x a m p l e {

    p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s ei m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {

    p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,

    O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

    / / P u l l t h e k e y o u tS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;T e x t o u t K e y = n e w T e x t ( k e y ) ;/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e/ / i t c a m e f r o m .T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;o c . c o l l e c t ( o u t K e y , o u t V a l ) ;

    }}p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e

    i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {

    p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,

    R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {/ / P u l l t h e k e y o u tS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;T e x t o u t K e y = n e w T e x t ( k e y ) ;/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e/ / i t c a m e f r o m .T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;o c . c o l l e c t ( o u t K e y , o u t V a l ) ;

    }}p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e

    i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {

    p u b l i c v o i d r e d u c e ( T e x t k e y ,I t e r a t o r < T e x t > i t e r ,O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

    / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n ds t o r e i t

    / / a c c o r d i n g l y .L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;

    w h i l e ( i t e r . h a s N e x t ( ) ) {

    T e x t t = i t e r . n e x t ( ) ;S t r i n g v a l u e = t . t oS t r i n g ( ) ;i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )

    f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;

    r e p o r t e r . s e t S t a t u s ( " O K " ) ;

    }

    / / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e sf o r ( S t r i n g s 1 : f i r s t ) {

    f o r ( S t r i n g s 2 : s e c o n d ) {S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;r e p o r t e r . s e t S t a t u s ( " O K " ) ;

    }}

    }}p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e

    i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {

    p u b l i c v o i d m a p (T e x t k ,T e x t v a l ,O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

    / / F i n d t h e u r lS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .T e x t o u t K e y = n e w T e x t ( k e y ) ;

    o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;}

    }p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e

    i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,W r i t a b l e > {

    p u b l i c v o i d r e d u c e (T e x t k ey ,I t e r a t o r < L o n g W r i t a b l e > i t e r ,O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

    / / A d d u p a l l t h e v a l u e s w e s e e

    l o n g s u m = 0 ;w hi l e ( i t e r . h a s N e x t ( ) ) {

    s u m + = i t e r . n e x t ( ) . g e t ( ) ;r e p o r t e r . s e t S t a t u s ( " O K " ) ;

    }

    o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;}

    }p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e

    im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,T e x t > {

    p u b l i c v o i d m a p (W r i t a b l e C o m p a r a b l e k e y ,W r i t a b l e v a l ,O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n {

    o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;}

    }

    p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s ei m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {

    i n t c o u n t = 0 ;p u b l i cv o i d r e d u c e (

    L o n g W r i t a b l e k e y ,I t e r a t o r < T e x t > i t e r ,O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {

    / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d sw h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) ) {

    o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;c o u n t + + ;

    }}

    }p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {

    J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;l p . s et J o b N a m e ( " L o a d P a g e s " ) ;l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;

    l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;

    l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n e w

    P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,

    n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p "l p . s e t N u m R e d u c e T a s k s ( 0 ) ;J o b l o a d P a g e s = n e w J o b ( l p ) ;

    J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l al f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a sl f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . cF i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w

    P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u ,

    n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ "l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;J o b l o a d U s e r s = n e w J o b ( l f u ) ;

    J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " )j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w

    P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w

    P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w

    P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;J o b j o i n J o b = n e w J o b ( j o i n ) ;j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ;

    J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t Fg r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e .g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s sg r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ;g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s sg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s )F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w

    P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w

    P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;J o b g r o u p J o b = n e w J o b ( g r o u p ) ;g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;

    J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e .

    t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u tt o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . ct o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t po r m a t . c l a s s ) ;

    t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s )t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l at o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a sF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e

    P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e

    P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ;t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;

    J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i n d t o1 0 0 s i t e s f o r

    1 8 t o 2 5 " ) ;j c . a d d J o b ( l o a d P a g e s ) ;j c . a d d J o b ( l o a d U s e r s ) ;j c . a d d J o b ( j o i n J o b ) ;j c . a d d J o b ( g r o u p J o b ) ;j c . a d d J o b ( l i m i t ) ;j c . r u n ( ) ;

    }}

    Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

  • 8/2/2019 Cloud MapReduce Zaharia

    42/52

    Users = loadusersas (name, age);

    Filtered = filter Users by

    age >= 18 and age

  • 8/2/2019 Cloud MapReduce Zaharia

    43/52

    Ease of Translation

    Notice how naturally the components of the job translate into Pig Latin.

    Load Users Load Pages

    Filter by age

    Join on name

    Group on url

    Count clicks

    Order by clicks

    Take top 5

    Users = load

    Fltrd = filter

    Pages = load

    Joined = join

    Grouped = group

    Summed = count()

    Sorted = orderTop5 = limit

    Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

  • 8/2/2019 Cloud MapReduce Zaharia

    44/52

    Ease of Translation

    Notice how naturally the components of the job translate into Pig Latin.

    Load Users Load Pages

    Filter by age

    Join on name

    Group on url

    Count clicks

    Order by clicks

    Take top 5

    Users = load

    Fltrd = filter

    Pages = load

    Joined = join

    Grouped = group

    Summed = count()

    Sorted = orderTop5 = limit

    Job 1

    Job 2

    Job 3

    Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

  • 8/2/2019 Cloud MapReduce Zaharia

    45/52

    Hive

    Developed at Facebook Used for majority of Facebook jobs

    Relational database built on Hadoop

    Maintains list of table schemas

    SQL-like query language (HQL)

    Can call Hadoop Streaming scripts from HQL

    Supports table partitioning, clustering, complexdata types, some optimizations

    C i Hi T bl

  • 8/2/2019 Cloud MapReduce Zaharia

    46/52

    Creating a Hive Table

    CREATE TABLE page_views(viewTime INT, userid BIGINT,

    page_url STRING, referrer_url STRING,

    ip STRING COMMENT 'User IP address')

    COMMENT 'This is the page view table'

    PARTITIONED BY(dt STRING, country STRING)

    STORED AS SEQUENCEFILE;

    Partitioning breaks table into separate filesfor each (dt, country) pair

    Ex: /hive/page_view/dt=2008-06-08,country=US/hive/page_view/dt=2008-06-08,country=CA

    Si l Q

  • 8/2/2019 Cloud MapReduce Zaharia

    47/52

    Simple Query

    SELECT page_views.*

    FROM page_viewsWHERE page_views.date >= '2008-03-01'

    AND page_views.date

  • 8/2/2019 Cloud MapReduce Zaharia

    48/52

    Aggregation and Joins

    SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)

    FROM page_views pv JOIN user u ON (pv.userid = u.id)

    GROUP BY pv.page_url, u.gender

    WHERE pv.date = '2008-03-03';

    Count users who visited each page by gender:

    Sample output:

    page_url gender count(userid)home.php MALE 12,141,412

    home.php FEMALE 15,431,579

    photo.php MALE 23,941,451

    photo.php FEMALE 21,231,314

    Using a Hadoop Streaming

  • 8/2/2019 Cloud MapReduce Zaharia

    49/52

    Using a Hadoop StreamingMapper Script

    SELECT TRANSFORM(page_views.userid,

    page_views.date)

    USING 'map_script.py'

    AS dt, uid CLUSTER BY dtFROM page_views;

    C l i

  • 8/2/2019 Cloud MapReduce Zaharia

    50/52

    Conclusions

    MapReduces data-parallel programming modelhides complexity of distribution and fault tolerance

    Principal philosophies:

    Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and

    administration costs (but requiring fault tolerance)

    Hive and Pig further simplify programming

    MapReduce is not suitable for all problems, butwhen it works, it may save you a lot of time

  • 8/2/2019 Cloud MapReduce Zaharia

    51/52

    Yahoo! Super Computer Cluster - M45

    Yahoo!s cluster is part of the Open Cirrus Testbed created by HP, Intel, and Yahoo! (seepress release at http://research.yahoo.com/node/2328).

    The availability of the Yahoo! cluster was first announced in November 2007 (see pressrelease at http://research.yahoo.com/node/1879).

    The cluster has approximately 4,000 processor-cores and 1.5 petabytes of disks.

    The Yahoo! cluster is intended to run the Apache open source software Hadoop and Pig.

    Each selected university will share the partition with up to three other universities. Theinitial duration of use is 6 months, potentially renewable for another 6 months upon

    written agreement.

    For further Information, please contact:http://cloud.citris-uc.org/Dr. Masoud Nikravesh

    CITRIS and LBNL, Executive Director, [email protected]

    Phone: (510) 643-4522

    R

    http://research.yahoo.com/node/2328http://research.yahoo.com/node/2328http://research.yahoo.com/node/2328http://research.yahoo.com/node/1879http://research.yahoo.com/node/1879mailto:[email protected]:[email protected]://research.yahoo.com/node/1879http://research.yahoo.com/node/2328
  • 8/2/2019 Cloud MapReduce Zaharia

    52/52

    Resources

    Hadoop: http://hadoop.apache.org/core/ Hadoop docs:

    http://hadoop.apache.org/core/docs/current/

    Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive

    Hadoop video tutorials from Cloudera:

    http://www.cloudera.com/hadoop-training

    http://hadoop.apache.org/core/http://hadoop.apache.org/core/docs/current/http://hadoop.apache.org/pighttp://hadoop.apache.org/hivehttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://hadoop.apache.org/hivehttp://hadoop.apache.org/pighttp://hadoop.apache.org/core/docs/current/http://hadoop.apache.org/core/