Reduce Side Joins

www.edureka.co/big-data-and-hadoop

Reduce side joins in Map Reduce

View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop

http://www.edureka.co/big-data-and-hadoop

Slide 2 www.edureka.co/big-data-and-hadoop

Objectives

What is Reduce side join

Why Reduce side join

Where we use MapReduce

MapReduce Flow

Steps to implement MapReduce

Run Reduce side join using MapReduce

At the end of this module, you will be able to


Why we join data??

Consider an example,

We have the data of a customer in two files/data/table

Cust_id Name Item

001 John iphone

002 Jenny laptop

Cust_id City Phone

001 NewYork 123456

003 Vegas 365895

To get the complete details, one needs to join both the data files

Using joins we can generate data which would be useful and sensible based on some key here it is Cust_id

John iphone NewYork 123456


Types of join in MapReduce

Data joins in hadoop

Map side Reduce side

• Happens on map side• Done in memory• One data is big other is small• Expensive

• Happens on reduce side• Done off memory• Both data is huge• Cheap


Where should Reduce Side Join be used ??

Joining data is arguably one of the biggest uses of Hadoop.

When one needs to implement joins simple steps. Reduce-side joins are straight forward due to the fact that

Hadoop sends identical keys to the same reducer, so by default the data is organized for us

Handy when all the files on which to be performed are huge in size

Should be used in case you are not in a hurry to get the result since it takes time to join huge data


Before we go ahead with Reduce side join let us refresh“Mapreduce”


Where MapReduce is Used?

Weather Forecasting

HealthCare

Problem Statement:» De-identify personal health information.

Problem Statement:» Finding Maximum temperature recorded in a year.


Where MapReduce is Used?

MapReduce

FeaturesLarge Scale Distributed Model

Used in

Function

Design Pattern

Parallel Programming

A Program Model

Classification

Analytics

Recommendation

Index and SearchMap

Reduce

ClassificationEg: Top N records

AnalyticsEg: Join, Selection

RecommendationEg: Sort

SummarizationEg: Inverted Index

Implemented

Google

Apache Hadoop

HDFS

Pig

Hive

HBase

For


MapReduce Paradigm

The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Deer Bear River

Dear Bear RiverCar Car RiverDeer Car Bear

Bear, 2Car, 3Deer, 2River, 2

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

K2,List(V2)List(K2,V2)K1,V1

Car Car River

Deer Car Bear

Bear, 2

Car, 3

Deer, 2

River, 2

Bear, (1,1)

Car, (1,1,1)

Deer, (1,1)

River, (1,1)


MapReduce Job Submission Flow

Input data is distributed to nodes

Node 1 Node 2

INPUT DATA




Each map task works on a “split” of dataMap

Node 1

Map

Node 2

INPUT DATA




Each map task works on a “split” of data

Mapper outputs intermediate data

Map

Node 1

Map

Node 2

INPUT DATA






Data exchange between nodes in a “shuffle” process

Map

Node 1

Map

Node 2

Node 1 Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Reducer output is stored

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA


Apart from keys we use tagging to identify the source of the file in reduce side joins.

We use different mappers to read the files individually.

Each value emitted from the mappers is tagged with unique identifier for a file

Output of all the mapper would go to one-one reducer based on unique keys

In the reducer, fields from different data sources are joined based on the common key from different files.

How it works Reduce Side??


File 1 File2

Map Task 1

{tag}

value

Map Task 2

{tag}

value

Reducer 1

Shuffling and sorting

Partitioner

Part-001 Part-002

Reducer 2

How it works Reduce Side??


Reduce Side Join

Demo

Documents

Reduce Side Joins