Hadoop Fundamentals

Hadoop Fundamentals

Satish MittalInMobi

Why Hadoop?

Big Data

•Sources: Server logs, clickstream, machine, sensor, social…

•Use-cases: batch/interactive/real-time

Scalableo Petabytes of data

Economicalo Use commodity hardwareo Share clusters among many applications

Reliableo Failure is common when you run thousands of machines.

Handle it well in the SW layer.

Simple programming modelo Applications must be simple to write and maintain

What is needed from a Distributed Platform?

Hadoop is peta-byte scale distributed data storage and data processing infrastructure Based on Google GFS & MR paper Contributed mostly by Yahoo! in the initial years and now

have a more widespread developer and user base 1000s of nodes, PBs of data in storage

What is Hadoop?

•Cheap JBODs for storage

•Move processing to where data is Location awareness (topology)

•Assume hardware failures to be the norm

•Map & Reduce primitives are fairly simple yet powerfulMost set operations can be performed using these

primitives

•Isolation

Hadoop Basics

Hadoop Distributed File System (HDFS)

Goals:Fault tolerant, scalable, distributed storage systemDesigned to reliably store very large files across machines in a large cluster

Assumptions: Files are written once and read several times Applications perform large sequential streaming reads Not a Unix-like, POSIX file system

Access via command line or Java API

HDFS

• Data is organized into files and directories• Files are divided into uniform sized blocks and distributed

across cluster nodes• Blocks are replicated to handle hardware failure• Filesystem keeps checksums of data for corruption

detection and recovery• HDFS exposes block placement so that computes can be

migrated to data

HDFS – Data Model

HDFS - Architecture

•Namenode is SPOF (HA for NN is now available in 2.0 Alpha)

•Responsible for managing a list of all active data nodes, FS name system (files, directories, blocks and their locations)

•Block placement policy

•Ensuring adequate replicas

•Writing edit logs durably

Namenode

• Service to allow data to be streamed in & out

• Block is the unit of data that data node understands

• Block reports to Namenode periodically

• Checksum checks, disk usage stats are managed by datanode

• Clients talk to datanode for actual data

• As long as there is at least one data node available to service file blocks, failures in datanodes can be tolerated, albeit at lower performance.

Datanode

HDFS – Write pipeline

DFS Client Namenode

Data node 1

Data node 2

Data node 3

Rack 2

Create file, get Block Loc (1)

DN 1, 2 & 3 (2)

Stream file (5)

Ack (5a)Ack

(4a)

Ack (3a)

Complete file (3b)

Rack 1

•Default is 3 replicas, but configurable•Blocks are placed (writes are pipelined):

On same nodeOn different rackOn the other rack

•Clients read from closest replica•If the replication for a block drops below target, it is automatically re-replicated.

HDFS – Block placement

•Data is checked with CRC32•File Creation

‣ Client computes checksum per block‣ DataNode stores the checksum

•File access‣ Client retrieves the data and checksum from

DataNode‣ If Validation fails, Client tries other replicas

HDFS – Data correctness

Simple commands• hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod

Uploading files• hadoop fs -put foo mydata/foo• cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile

Downloading files• hadoop fs -get mydata/foo foo• hadoop fs -get - mydata/ReallyBigFile | grep “the answer is”• hadoop fs -cat mydata/foo

Admin• hadoop dfsadmin –report• hadoop fsck

Interacting with HDFS

Map-Reduce

Say we have 100s of machines available to us. How do we write applications on them?

As an example, consider the problem of creating an index for search. ‣ Input: Hundreds of documents‣ Output: A mapping of word to document IDs‣ Resources: A few machines

Map-Reduce Application

The problem : Inverted Index

Farmer1 has the following animals: bees, cows, goats.

Some other animals …

Animals: 1, 2, 3, 4, 12Bees: 1, 2, 23, 34

Dog: 3,9Farmer1: 1, 7

…

Building an inverted index

Machine1

Machine2

Machine3

Animals: 1,3Dog: 3

Animals:2,12 Bees: 23

Dog:9Farmer1: 7

Machine4

Animals: 1,3Animals:2,12

Bees:23

Machine5

Dog: 3Dog:9

Farmer1: 7

Machine4

Animals: 1,2,3,12Bees:23

Machine5

Dog: 3,9Farmer1: 7

In our example‣ Map: (doc-num, text) ➝ [(word, doc-num)]

‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]

General form:‣ Two functions: Map and Reduce

‣ Operate on key and value pairs

‣ Map: (K1, V1) ➝ list(K2, V2)

‣ Reduce: (K2, list(V2)) ➝ (K3, V3)

‣ Primitives present in Lisp and other functional languages

Same principle extended to distributed computing‣ Map and Reduce tasks run on distributed sets of machines

This is Map-Reduce

Abstracts functionality common to all Map/Reduce applications‣ Distribute tasks to multiple machines‣ Sorts, transfers and merges intermediate data from all machines from the

Map phase to the Reduce phase‣ Monitors task progress‣ Handles faulty machines, faulty tasks transparently

Provides pluggable APIs and configuration mechanisms for writing applications‣ Map and Reduce functions‣ Input formats and splits‣ Number of tasks, data types, etc…

Provides status about jobs to users

Map-Reduce Framework

MR – Architecture

Job Client Job Tracker

DFS ClientDFS ClientDFS ClientDFS ClientDFS ClientDFS ClientTask Tracker

Heartbeat Task Assignment

Shuffle

Submit

Progress

HDFS

•All user code runs in isolated JVM

•Client computes splits

•JT just schedules these splits (one mapper per split)

•Mapper, Reducer, Partitioner and Combiner and any custom Input/OutputFormat runs in user JVM

•Idempotence

Map-Reduce

Hadoop HDFS + MR cluster

Machines with Datanodes and Tasktrackers

D D D DTT

JobTracker

Namenode

T T TD

Client

Submit Job

HTTP Monitoring UI Get Block Locations

•Input: A bunch of large text files

•Desired Output: Frequencies of Words

WordCount: Hello World of Hadoop

Hadoop – Two services in one

Mapper‣ Input: value: lines of text of input

‣ Output: key: word, value: 1

Reducer‣ Input: key: word, value: set of counts

‣ Output: key: word, value: sum

Launching program‣ Defines the job

‣ Submits job to cluster

Word Count Example

Questions ?

Thank You!

mailto: [email protected]

Technology

Hadoop Fundamentals