19
www.edureka.co/big-data-and-hadoop Reduce side joins in Map Reduce View Big Data and Hadoop Course at: http:// www.edureka.co/big-data-and-hadoop

Reduce Side Joins

  • Upload
    edureka

  • View
    230

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reduce Side Joins

www.edureka.co/big-data-and-hadoop

Reduce side joins in Map Reduce

View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

Page 2: Reduce Side Joins

Slide 2 www.edureka.co/big-data-and-hadoop

Objectives

What is Reduce side join

Why Reduce side join

Where we use MapReduce

MapReduce Flow

Steps to implement MapReduce

Run Reduce side join using MapReduce

At the end of this module, you will be able to

Page 3: Reduce Side Joins

Slide 3 www.edureka.co/big-data-and-hadoop

Why we join data??

Consider an example,

We have the data of a customer in two files/data/table

Cust_id Name Item

001 John iphone

002 Jenny laptop

Cust_id City Phone

001 NewYork 123456

003 Vegas 365895

To get the complete details, one needs to join both the data files

Using joins we can generate data which would be useful and sensible based on some key here it is Cust_id

John iphone NewYork 123456

Page 4: Reduce Side Joins

Slide 4 www.edureka.co/big-data-and-hadoop

Types of join in MapReduce

Data joins in hadoop

Map side Reduce side

• Happens on map side• Done in memory• One data is big other is small• Expensive

• Happens on reduce side• Done off memory• Both data is huge• Cheap

Page 5: Reduce Side Joins

Slide 5 www.edureka.co/big-data-and-hadoop

Where should Reduce Side Join be used ??

Joining data is arguably one of the biggest uses of Hadoop.

When one needs to implement joins simple steps. Reduce-side joins are straight forward due to the fact that

Hadoop sends identical keys to the same reducer, so by default the data is organized for us

Handy when all the files on which to be performed are huge in size

Should be used in case you are not in a hurry to get the result since it takes time to join huge data

Page 6: Reduce Side Joins

Slide 6 www.edureka.co/big-data-and-hadoop

Before we go ahead with Reduce side join let us refresh“Mapreduce”

Page 7: Reduce Side Joins

Slide 7 www.edureka.co/big-data-and-hadoop

Where MapReduce is Used?

Weather Forecasting

HealthCare

Problem Statement:» De-identify personal health information.

Problem Statement:» Finding Maximum temperature recorded in a year.

Page 8: Reduce Side Joins

Slide 8 www.edureka.co/big-data-and-hadoop

Where MapReduce is Used?

MapReduce

FeaturesLarge Scale Distributed Model

Used in

Function

Design Pattern

Parallel Programming

A Program Model

Classification

Analytics

Recommendation

Index and SearchMap

Reduce

ClassificationEg: Top N records

AnalyticsEg: Join, Selection

RecommendationEg: Sort

SummarizationEg: Inverted Index

Implemented

Google

Apache Hadoop

HDFS

Pig

Hive

HBase

For

Page 9: Reduce Side Joins

Slide 9 www.edureka.co/big-data-and-hadoop

MapReduce Paradigm

The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Deer Bear River

Dear Bear RiverCar Car RiverDeer Car Bear

Bear, 2Car, 3Deer, 2River, 2

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

K2,List(V2)List(K2,V2)K1,V1

Car Car River

Deer Car Bear

Bear, 2

Car, 3

Deer, 2

River, 2

Bear, (1,1)

Car, (1,1,1)

Deer, (1,1)

River, (1,1)

Page 10: Reduce Side Joins

Slide 10 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Node 1 Node 2

INPUT DATA

Page 11: Reduce Side Joins

Slide 11 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of dataMap

Node 1

Map

Node 2

INPUT DATA

Page 12: Reduce Side Joins

Slide 12 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Map

Node 1

Map

Node 2

INPUT DATA

Page 13: Reduce Side Joins

Slide 13 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Map

Node 1

Map

Node 2

Node 1 Node 2

INPUT DATA

Page 14: Reduce Side Joins

Slide 14 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA

Page 15: Reduce Side Joins

Slide 15 www.edureka.co/big-data-and-hadoop

MapReduce Job Submission Flow

Input data is distributed to nodes

Each map task works on a “split” of data

Mapper outputs intermediate data

Data exchange between nodes in a “shuffle” process

Intermediate data of the same key goes to the same reducer

Reducer output is stored

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA

Page 16: Reduce Side Joins

Slide 16 www.edureka.co/big-data-and-hadoop

Apart from keys we use tagging to identify the source of the file in reduce side joins.

We use different mappers to read the files individually.

Each value emitted from the mappers is tagged with unique identifier for a file

Output of all the mapper would go to one-one reducer based on unique keys

In the reducer, fields from different data sources are joined based on the common key from different files.

How it works Reduce Side??

Page 17: Reduce Side Joins

Slide 17 www.edureka.co/big-data-and-hadoop

File 1 File2

Map Task 1

{tag}

value

Map Task 2

{tag}

value

Reducer 1

Shuffling and sorting

Partitioner

Part-001 Part-002

Reducer 2

How it works Reduce Side??

Page 18: Reduce Side Joins

Slide 18 www.edureka.co/big-data-and-hadoop

Reduce Side Join

Demo

Page 19: Reduce Side Joins