Upload
elliando-dias
View
1.815
Download
1
Tags:
Embed Size (px)
Citation preview
Hadoop
Simple. Scalable.
@markgunnels
Java. Clojure. Ruby.
Cloudera Certified
posscon.org
April 15, 16, and 17
Agenda
OverviewMassively Large Data Sets and the problems thereinDistributed File SystemMapReducePig
Overview
Doug Cutting
Genius
Favorite Hadoop Story
New York Times
4 Terabytes of Source Articles.
24 Hours.
5.5 Terabytes of PDFs.
Did it again.
$240.
Infoporn from Yahoo
73 hours490 TB Shuffling280 TB Output4000 Nodes16 PB Disk Space32K Cores64 TB RAM
Hadoop solves...
Analyzing Massively Large Datasets
Two Problems
You have to distribute.
Data Storage
Capacity has increased rapidly beyond read speeds. Datasets
won't fit on one disk. Tolerate node failure.
Data Analysis
Combine data from many machines. Tolerate node failure.
How Hadoop solves these problems.
Send Code to Data. Not Data to Code.
Data Storage
HDFS
Name Node. Data Nodes.
Master - Slave Relationship
Shard massive files across multiple machines.
MB, GB, and TB
Tolerant of Node Failure
Files replicated across at least 3 nodes.
HDFS behaves like a normal file system.
No true appends yet.
Demonstration.
Data Analysis
MapReduce
Job Tracker. Task Nodes.
Master - Slave Relationship.
map
Demonstration
pmap
Demonstration
reduce
Demonstration
(reduce (pmap))
Demonstration.
MapReduce
Java
Nobody likes it.
:-)
MapReduce
Ruby. Python. Unix Utilities.
MapReduce
Clojure
Hadoop Ecosystem
Pigkeeper. Hive. Cascading.
Pig
HBase