Upload
pravin-kumar-singh-pmp-psm
View
72
Download
0
Embed Size (px)
Citation preview
Pravin Singh
introducing
BIG DATA
WHAT THE HECK IS BIG DATA?
Any collection of data sets so large and complex that it becomes difficult to process using current data management tools or traditional data processing applications.
Volume
• Exceeds physical limits of vertical scalability
Velocity
• Decision window small due to data change rate
Variety
• Many different formats make integration expensive
WHY SO LOW-COST?
Source: EMC
WHY SO LOW-COST?
Source: EMC
WHY SO FAST?
Massive Parallel Processing Data Locality Optimized for write once – read many Sequential reads, not random access
Hello Hadoop!
You have an interesting name.
1
Hadoop Architecture
Source: Hortonworks
The Hadoop Zoo
HDFS
MapReduce
Pig Hive HCat Giraph Mahout
Zookeeper
The Real Simple Hadoop Architecture
MapReduce Engine
JobTracker TaskTracker 1
TaskTracker 2 … TaskTracker
N
HDFS ClusterNameNod
eDataNode
1DataNode
2 … DataNode N
Hello HDFS!
Have we met before?
2
HDFS
My Data.txt
150 MB
64 MB
64 MB
22 MBName Node
64 MB64 MB
64 MB64 MB
22 MB22 MB
3 Hello MapReduce!
Have you lost some weight?
MapReduce
Input File Map
<Key, Value> <Key, Value><Key, Value>
.
.
Shuffle & Sort
<Key, Value> <Key, Value><Key, Value>
.
.
Reduce Result
MapReduce
Big Data for Dummies.txt
How many times the words “Big data” and“Hadoop” show up?
MapReduce
<Big data, 7><Hadoop, 4>
<Big data, 9><Hadoop, 6>
<Big data, 3><Hadoop, 8>
<Big data, 7><Big data, 9><Big data, 3><Hadoop, 4><Hadoop, 6><Hadoop, 8>
<Big data, 7, 9, 3><Hadoop, 4, 6, 8>
<Big data, 19><Hadoop, 18>
Let’s Play MapReduce!’coz All Talk and No Play Makes Session a Dull Affair.
?Questions. Comments. Feedback.
See you at the (Data) Lake Next Time.THANK YOU!