44
Hadoop Simple. Scalable.

Hadoop - Simple. Scalable

Embed Size (px)

Citation preview

Page 1: Hadoop - Simple. Scalable

Hadoop

Simple. Scalable.

Page 2: Hadoop - Simple. Scalable

@markgunnels

[email protected]

Page 3: Hadoop - Simple. Scalable

Java. Clojure. Ruby.

Cloudera Certified

Page 4: Hadoop - Simple. Scalable

posscon.org

April 15, 16, and 17

Page 5: Hadoop - Simple. Scalable

Agenda

OverviewMassively Large Data Sets and the problems thereinDistributed File SystemMapReducePig

Page 6: Hadoop - Simple. Scalable

Overview

Page 7: Hadoop - Simple. Scalable

Doug Cutting

Genius

Page 8: Hadoop - Simple. Scalable

Favorite Hadoop Story

New York Times

Page 9: Hadoop - Simple. Scalable

4 Terabytes of Source Articles.

Page 10: Hadoop - Simple. Scalable

24 Hours.

Page 11: Hadoop - Simple. Scalable

5.5 Terabytes of PDFs.

Page 12: Hadoop - Simple. Scalable

Did it again.

Page 13: Hadoop - Simple. Scalable

$240.

Page 14: Hadoop - Simple. Scalable

Infoporn from Yahoo

73 hours490 TB Shuffling280 TB Output4000 Nodes16 PB Disk Space32K Cores64 TB RAM

Page 15: Hadoop - Simple. Scalable

Hadoop solves...

Page 16: Hadoop - Simple. Scalable

Analyzing Massively Large Datasets

Page 17: Hadoop - Simple. Scalable

Two Problems

You have to distribute.

Page 18: Hadoop - Simple. Scalable

Data Storage

Capacity has increased rapidly beyond read speeds. Datasets

won't fit on one disk. Tolerate node failure.

Page 19: Hadoop - Simple. Scalable

Data Analysis

Combine data from many machines. Tolerate node failure.

Page 20: Hadoop - Simple. Scalable

How Hadoop solves these problems.

Page 21: Hadoop - Simple. Scalable

Send Code to Data. Not Data to Code.

Page 22: Hadoop - Simple. Scalable

Data Storage

HDFS

Page 23: Hadoop - Simple. Scalable

Name Node. Data Nodes.

Master - Slave Relationship

Page 24: Hadoop - Simple. Scalable

Shard massive files across multiple machines.

MB, GB, and TB

Page 25: Hadoop - Simple. Scalable

Tolerant of Node Failure

Files replicated across at least 3 nodes.

Page 26: Hadoop - Simple. Scalable

HDFS behaves like a normal file system.

No true appends yet.

Page 27: Hadoop - Simple. Scalable

Demonstration.

Page 28: Hadoop - Simple. Scalable

Data Analysis

MapReduce

Page 29: Hadoop - Simple. Scalable

Job Tracker. Task Nodes.

Master - Slave Relationship.

Page 30: Hadoop - Simple. Scalable

map

Page 31: Hadoop - Simple. Scalable

Demonstration

Page 32: Hadoop - Simple. Scalable

pmap

Page 33: Hadoop - Simple. Scalable

Demonstration

Page 34: Hadoop - Simple. Scalable

reduce

Page 35: Hadoop - Simple. Scalable

Demonstration

Page 36: Hadoop - Simple. Scalable

(reduce (pmap))

Page 37: Hadoop - Simple. Scalable

Demonstration.

Page 38: Hadoop - Simple. Scalable

MapReduce

Java

Page 39: Hadoop - Simple. Scalable

Nobody likes it.

:-)

Page 40: Hadoop - Simple. Scalable

MapReduce

Ruby. Python. Unix Utilities.

Page 41: Hadoop - Simple. Scalable

MapReduce

Clojure

Page 42: Hadoop - Simple. Scalable

Hadoop Ecosystem

Pigkeeper. Hive. Cascading.

Page 43: Hadoop - Simple. Scalable

Pig

Page 44: Hadoop - Simple. Scalable

HBase