Upload
edureka
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
www.edureka.co/big-data-and-hadoop
Reduce side joins in Map Reduce
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
What is Reduce side join
Why Reduce side join
Where we use MapReduce
MapReduce Flow
Steps to implement MapReduce
Run Reduce side join using MapReduce
At the end of this module, you will be able to
Slide 3 www.edureka.co/big-data-and-hadoop
Why we join data??
Consider an example,
We have the data of a customer in two files/data/table
Cust_id Name Item
001 John iphone
002 Jenny laptop
Cust_id City Phone
001 NewYork 123456
003 Vegas 365895
To get the complete details, one needs to join both the data files
Using joins we can generate data which would be useful and sensible based on some key here it is Cust_id
John iphone NewYork 123456
Slide 4 www.edureka.co/big-data-and-hadoop
Types of join in MapReduce
Data joins in hadoop
Map side Reduce side
• Happens on map side• Done in memory• One data is big other is small• Expensive
• Happens on reduce side• Done off memory• Both data is huge• Cheap
Slide 5 www.edureka.co/big-data-and-hadoop
Where should Reduce Side Join be used ??
Joining data is arguably one of the biggest uses of Hadoop.
When one needs to implement joins simple steps. Reduce-side joins are straight forward due to the fact that
Hadoop sends identical keys to the same reducer, so by default the data is organized for us
Handy when all the files on which to be performed are huge in size
Should be used in case you are not in a hurry to get the result since it takes time to join huge data
Slide 6 www.edureka.co/big-data-and-hadoop
Before we go ahead with Reduce side join let us refresh“Mapreduce”
Slide 7 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
Problem Statement:» De-identify personal health information.
Problem Statement:» Finding Maximum temperature recorded in a year.
Slide 8 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale Distributed Model
Used in
Function
Design Pattern
Parallel Programming
A Program Model
Classification
Analytics
Recommendation
Index and SearchMap
Reduce
ClassificationEg: Top N records
AnalyticsEg: Join, Selection
RecommendationEg: Sort
SummarizationEg: Inverted Index
Implemented
Apache Hadoop
HDFS
Pig
Hive
HBase
For
Slide 9 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)Deer Bear River
Dear Bear RiverCar Car RiverDeer Car Bear
Bear, 2Car, 3Deer, 2River, 2
Deer, 1Bear, 1River, 1
Car, 1Car, 1
River, 1
Deer, 1Car, 1Bear, 1
K2,List(V2)List(K2,V2)K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
Slide 10 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA
Slide 11 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of dataMap
Node 1
Map
Node 2
INPUT DATA
Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 13 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA
Slide 14 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 15 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 16 www.edureka.co/big-data-and-hadoop
Apart from keys we use tagging to identify the source of the file in reduce side joins.
We use different mappers to read the files individually.
Each value emitted from the mappers is tagged with unique identifier for a file
Output of all the mapper would go to one-one reducer based on unique keys
In the reducer, fields from different data sources are joined based on the common key from different files.
How it works Reduce Side??
Slide 17 www.edureka.co/big-data-and-hadoop
File 1 File2
Map Task 1
{tag}
value
Map Task 2
{tag}
value
Reducer 1
Shuffling and sorting
Partitioner
Part-001 Part-002
Reducer 2
How it works Reduce Side??
Slide 18 www.edureka.co/big-data-and-hadoop
Reduce Side Join
Demo