A Basic Introduction to the Hadoop eco system - no animation

Basic introduction to the Hadoop Eco-System

Sameer TiwariHadoop Architect, Pivotal [email protected], @sameertech

mailto:[email protected]

Break it down

• Raw Storage - HDFS

• Columnar Store - HBase

• Query engines - Hive, Pig

• Schedulers - Map-Reduce, YARN

• Streaming - Flume

• Machine Learning - Mahout

• Workflow - Oozie

• Distributed Locking - Zookeeper

Break it down

HDFS

Map Reduce / YARN

Pig Hive Oozie Mahout

HBase

Zookeeper

Flume

Sqoop

HDFSAPI

Unix OS and File System

Hadoop Distributed File System(HDFS)

• History

o Based on Google File System Paper (2003)

o Built at Yahoo by a small team

• Goalso Tolerance to Hardware failureo Sequential access as opposed to Randomo High aggregated throughput for Large Data Setso “Write Once Read Many” paradigm

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

AB1 AB2 BB1

BB1

AB1

BB1

AB1

Client2-FileB

Rack 1 Rack 2

AB2 AB2

File.create()MetaDataNN OPs

Data BlocksDN OPs

File.write()

FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4

FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4

Replication PipeLining

Map Reduce

InputMappers

ReducersOutput

Shuffle/Sort

map(key1,value) -> list<key2,value2>, reduce(key2, list<value2>) -> list<value3>

Map Reduce

Job Tracker

Task TrackerClient

1

Client 2

Task Tracker

Task

Task

1,2,4

HDFS

3

5

6

5

6

1. Client submit job using to JT

2. JT responds with jobid

3. JobClient Copies job resources to HDFS

4. Submit job to JT5. TT Heartbeat to JT

gets the task6. TT gets the task from

HDFS7. Execute Task Map or

Reduce

YARN

Resource Manager

NodeManager

Client

AppMaster

Container

NodeManager

AppMaster

Container

1

2

3,4,8

55

6

6

7

Notes on previous YARN slide1. A client program submits the application, including the necessary specifications to launch the application-

specific ApplicationMaster itself.

2. The ResourceManager assumes the responsibility to negotiate a specified container in which to start the

ApplicationMaster and then launches the ApplicationMaster.

3. The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client

program to query the ResourceManager for details, which allow it to directly communicate with its own

ApplicationMaster.

4. During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-

request protocol.

5. On successful container allocations, the ApplicationMaster launches the container by providing the container

launch specification to the NodeManager. The launch specification, typically, includes the necessary information

to allow the container to communicate with the ApplicationMaster itself.

6. The application code executing within the container then provides necessary information (progress, status etc.)

to its ApplicationMaster via an application-specific protocol.

7. During the application execution, the client that submitted the program communicates directly with the

ApplicationMaster to get status, progress updates etc. via an application-specific protocol.

8. Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters

with the ResourceManager and shuts down, allowing its own container to be repurposed.

Flume

http://flume.apache.org/

HBase

• Historyo Based on Google’s Big Table (2006)

o Built at Powerset (later acquired by Microsoft)

o Facebook and Yahoo use it extensively (~1000 machines)

• Goalso Random R/W access

o Tables with Billions of Rows X Millions of Columns

o Often referred to as a “NoSQL” Data store

o High speed ingest rate. FB == ~Billion msgs+chat per day.

o Good consistency model

HBase - Key Components

NameNodeJobTrackerHMaster

DataNodeTaskTrackerHRegionServer

ZK ClusterZK

ClusterZK Cluster

Client

Master(s):Active and Backup

Slaves:Many

• Google BigTable on GFS == HBase on HDFS

• Generally co-located with HDFS

• Depends on HDFS for storing its data

• Follows a Master Slave model

• Depends on a ZK quorum for Master election

Mahout

• Parallel Machine Learning and Data mining library

• Core groups of algorithmso Recommendation - Netflix, Pandorao Classification - “look-alike”, pattern recognitiono Clustering - Marketing and Sales

• Uses Map Reduce under the covers

Hive and Pig

• Higher level languages for using MapReduce

• Hive

o Convenience of storing data in Tables with schemas

o Has a SQL “like” language called HiveQL

o Builds a simple optimized execution plan

• Pigo Scripting language interfaceo Used for ETLo Schemas can be used with HCatalog

Additional Components

• HCatalog for Pig and Map-Reduce

• Workflow - Oozie

• Distributed Locking - Zookeeper

• Spark and Shark from UC Berkeley

Questions?

Hadoop Eco-System

Sameer TiwariHadoop Architect, Pivotal [email protected], @sameertech

mailto:[email protected]

Software

A Basic Introduction to the Hadoop eco system - no animation