30
Hadoop Fundamentals Satish Mittal InMobi

Hadoop Fundamentals

  • Upload
    itsskm

  • View
    600

  • Download
    0

Embed Size (px)

DESCRIPTION

This deck covers the basic concepts behind Hadoop distributed file system (HDFS) and Hadoop Map-Reduce framework.

Citation preview

Page 1: Hadoop Fundamentals

Hadoop Fundamentals

Satish MittalInMobi

Page 2: Hadoop Fundamentals

Why Hadoop?

Page 3: Hadoop Fundamentals

Big Data

•Sources: Server logs, clickstream, machine, sensor, social…

•Use-cases: batch/interactive/real-time

Page 4: Hadoop Fundamentals

Scalableo Petabytes of data

Economicalo Use commodity hardwareo Share clusters among many applications

Reliableo Failure is common when you run thousands of machines.

Handle it well in the SW layer.

Simple programming modelo Applications must be simple to write and maintain

What is needed from a Distributed Platform?

Page 5: Hadoop Fundamentals

Hadoop is peta-byte scale distributed data storage and data processing infrastructure Based on Google GFS & MR paper Contributed mostly by Yahoo! in the initial years and now

have a more widespread developer and user base 1000s of nodes, PBs of data in storage

What is Hadoop?

Page 6: Hadoop Fundamentals

•Cheap JBODs for storage

•Move processing to where data is Location awareness (topology)

•Assume hardware failures to be the norm

•Map & Reduce primitives are fairly simple yet powerfulMost set operations can be performed using these

primitives

•Isolation

Hadoop Basics

Page 7: Hadoop Fundamentals

Hadoop Distributed File System (HDFS)

Page 8: Hadoop Fundamentals

Goals:Fault tolerant, scalable, distributed storage systemDesigned to reliably store very large files across machines in a large cluster

Assumptions: Files are written once and read several times Applications perform large sequential streaming reads Not a Unix-like, POSIX file system

Access via command line or Java API

HDFS

Page 9: Hadoop Fundamentals

• Data is organized into files and directories• Files are divided into uniform sized blocks and distributed

across cluster nodes• Blocks are replicated to handle hardware failure• Filesystem keeps checksums of data for corruption

detection and recovery• HDFS exposes block placement so that computes can be

migrated to data

HDFS – Data Model

Page 10: Hadoop Fundamentals

HDFS - Architecture

Page 11: Hadoop Fundamentals

•Namenode is SPOF (HA for NN is now available in 2.0 Alpha)

•Responsible for managing a list of all active data nodes, FS name system (files, directories, blocks and their locations)

•Block placement policy

•Ensuring adequate replicas

•Writing edit logs durably

Namenode

Page 12: Hadoop Fundamentals

• Service to allow data to be streamed in & out

• Block is the unit of data that data node understands

• Block reports to Namenode periodically

• Checksum checks, disk usage stats are managed by datanode

• Clients talk to datanode for actual data

• As long as there is at least one data node available to service file blocks, failures in datanodes can be tolerated, albeit at lower performance.

Datanode

Page 13: Hadoop Fundamentals

HDFS – Write pipeline

DFS Client Namenode

Data node 1

Data node 2

Data node 3

Rack 2

Create file, get Block Loc (1)

DN 1, 2 & 3 (2)

Stream file (5)

Ack (5a)Ack

(4a)

Ack (3a)

Complete file (3b)

Rack 1

Page 14: Hadoop Fundamentals

•Default is 3 replicas, but configurable•Blocks are placed (writes are pipelined):

On same nodeOn different rackOn the other rack

•Clients read from closest replica•If the replication for a block drops below target, it is automatically re-replicated.

HDFS – Block placement

Page 15: Hadoop Fundamentals

•Data is checked with CRC32•File Creation

‣ Client computes checksum per block‣ DataNode stores the checksum

•File access‣ Client retrieves the data and checksum from

DataNode‣ If Validation fails, Client tries other replicas

HDFS – Data correctness

Page 16: Hadoop Fundamentals

Simple commands• hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod

Uploading files• hadoop fs -put foo mydata/foo• cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile

Downloading files• hadoop fs -get mydata/foo foo• hadoop fs -get - mydata/ReallyBigFile | grep “the answer is”• hadoop fs -cat mydata/foo

Admin• hadoop dfsadmin –report• hadoop fsck

Interacting with HDFS

Page 17: Hadoop Fundamentals

Map-Reduce

Page 18: Hadoop Fundamentals

Say we have 100s of machines available to us. How do we write applications on them?

As an example, consider the problem of creating an index for search. ‣ Input: Hundreds of documents‣ Output: A mapping of word to document IDs‣ Resources: A few machines

Map-Reduce Application

Page 19: Hadoop Fundamentals

The problem : Inverted Index

Farmer1 has the following animals: bees, cows, goats.

Some other animals …

Animals: 1, 2, 3, 4, 12Bees: 1, 2, 23, 34

Dog: 3,9Farmer1: 1, 7

Page 20: Hadoop Fundamentals

Building an inverted index

Machine1

Machine2

Machine3

Animals: 1,3Dog: 3

Animals:2,12 Bees: 23

Dog:9Farmer1: 7

Machine4

Animals: 1,3Animals:2,12

Bees:23

Machine5

Dog: 3Dog:9

Farmer1: 7

Machine4

Animals: 1,2,3,12Bees:23

Machine5

Dog: 3,9Farmer1: 7

Page 21: Hadoop Fundamentals

In our example‣ Map: (doc-num, text) ➝ [(word, doc-num)]

‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]

General form:‣ Two functions: Map and Reduce

‣ Operate on key and value pairs

‣ Map: (K1, V1) ➝ list(K2, V2)

‣ Reduce: (K2, list(V2)) ➝ (K3, V3)

‣ Primitives present in Lisp and other functional languages

Same principle extended to distributed computing‣ Map and Reduce tasks run on distributed sets of machines

This is Map-Reduce

Page 22: Hadoop Fundamentals

Abstracts functionality common to all Map/Reduce applications‣ Distribute tasks to multiple machines‣ Sorts, transfers and merges intermediate data from all machines from the

Map phase to the Reduce phase‣ Monitors task progress‣ Handles faulty machines, faulty tasks transparently

Provides pluggable APIs and configuration mechanisms for writing applications‣ Map and Reduce functions‣ Input formats and splits‣ Number of tasks, data types, etc…

Provides status about jobs to users

Map-Reduce Framework

Page 23: Hadoop Fundamentals

MR – Architecture

Job Client Job Tracker

DFS ClientDFS ClientDFS ClientDFS ClientDFS ClientDFS ClientTask Tracker

Heartbeat Task Assignment

Shuffle

Submit

Progress

HDFS

Page 24: Hadoop Fundamentals

•All user code runs in isolated JVM

•Client computes splits

•JT just schedules these splits (one mapper per split)

•Mapper, Reducer, Partitioner and Combiner and any custom Input/OutputFormat runs in user JVM

•Idempotence

Map-Reduce

Page 25: Hadoop Fundamentals

Hadoop HDFS + MR cluster

Machines with Datanodes and Tasktrackers

D D D DTT

JobTracker

Namenode

T T TD

Client

Submit Job

HTTP Monitoring UI Get Block Locations

Page 26: Hadoop Fundamentals

•Input: A bunch of large text files

•Desired Output: Frequencies of Words

WordCount: Hello World of Hadoop

Page 27: Hadoop Fundamentals

Hadoop – Two services in one

Page 28: Hadoop Fundamentals

Mapper‣ Input: value: lines of text of input

‣ Output: key: word, value: 1

Reducer‣ Input: key: word, value: set of counts

‣ Output: key: word, value: sum

Launching program‣ Defines the job

‣ Submits job to cluster

Word Count Example

Page 29: Hadoop Fundamentals

Questions ?

Page 30: Hadoop Fundamentals

Thank You!

mailto: [email protected]