Upload
itsskm
View
600
Download
0
Embed Size (px)
DESCRIPTION
This deck covers the basic concepts behind Hadoop distributed file system (HDFS) and Hadoop Map-Reduce framework.
Citation preview
Hadoop Fundamentals
Satish MittalInMobi
Why Hadoop?
Big Data
•Sources: Server logs, clickstream, machine, sensor, social…
•Use-cases: batch/interactive/real-time
Scalableo Petabytes of data
Economicalo Use commodity hardwareo Share clusters among many applications
Reliableo Failure is common when you run thousands of machines.
Handle it well in the SW layer.
Simple programming modelo Applications must be simple to write and maintain
What is needed from a Distributed Platform?
Hadoop is peta-byte scale distributed data storage and data processing infrastructure Based on Google GFS & MR paper Contributed mostly by Yahoo! in the initial years and now
have a more widespread developer and user base 1000s of nodes, PBs of data in storage
What is Hadoop?
•Cheap JBODs for storage
•Move processing to where data is Location awareness (topology)
•Assume hardware failures to be the norm
•Map & Reduce primitives are fairly simple yet powerfulMost set operations can be performed using these
primitives
•Isolation
Hadoop Basics
Hadoop Distributed File System (HDFS)
Goals:Fault tolerant, scalable, distributed storage systemDesigned to reliably store very large files across machines in a large cluster
Assumptions: Files are written once and read several times Applications perform large sequential streaming reads Not a Unix-like, POSIX file system
Access via command line or Java API
HDFS
• Data is organized into files and directories• Files are divided into uniform sized blocks and distributed
across cluster nodes• Blocks are replicated to handle hardware failure• Filesystem keeps checksums of data for corruption
detection and recovery• HDFS exposes block placement so that computes can be
migrated to data
HDFS – Data Model
HDFS - Architecture
•Namenode is SPOF (HA for NN is now available in 2.0 Alpha)
•Responsible for managing a list of all active data nodes, FS name system (files, directories, blocks and their locations)
•Block placement policy
•Ensuring adequate replicas
•Writing edit logs durably
Namenode
• Service to allow data to be streamed in & out
• Block is the unit of data that data node understands
• Block reports to Namenode periodically
• Checksum checks, disk usage stats are managed by datanode
• Clients talk to datanode for actual data
• As long as there is at least one data node available to service file blocks, failures in datanodes can be tolerated, albeit at lower performance.
Datanode
HDFS – Write pipeline
DFS Client Namenode
Data node 1
Data node 2
Data node 3
Rack 2
Create file, get Block Loc (1)
DN 1, 2 & 3 (2)
Stream file (5)
Ack (5a)Ack
(4a)
Ack (3a)
Complete file (3b)
Rack 1
•Default is 3 replicas, but configurable•Blocks are placed (writes are pipelined):
On same nodeOn different rackOn the other rack
•Clients read from closest replica•If the replication for a block drops below target, it is automatically re-replicated.
HDFS – Block placement
•Data is checked with CRC32•File Creation
‣ Client computes checksum per block‣ DataNode stores the checksum
•File access‣ Client retrieves the data and checksum from
DataNode‣ If Validation fails, Client tries other replicas
HDFS – Data correctness
Simple commands• hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod
Uploading files• hadoop fs -put foo mydata/foo• cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile
Downloading files• hadoop fs -get mydata/foo foo• hadoop fs -get - mydata/ReallyBigFile | grep “the answer is”• hadoop fs -cat mydata/foo
Admin• hadoop dfsadmin –report• hadoop fsck
Interacting with HDFS
Map-Reduce
Say we have 100s of machines available to us. How do we write applications on them?
As an example, consider the problem of creating an index for search. ‣ Input: Hundreds of documents‣ Output: A mapping of word to document IDs‣ Resources: A few machines
Map-Reduce Application
The problem : Inverted Index
Farmer1 has the following animals: bees, cows, goats.
Some other animals …
Animals: 1, 2, 3, 4, 12Bees: 1, 2, 23, 34
Dog: 3,9Farmer1: 1, 7
…
Building an inverted index
Machine1
Machine2
Machine3
Animals: 1,3Dog: 3
Animals:2,12 Bees: 23
Dog:9Farmer1: 7
Machine4
Animals: 1,3Animals:2,12
Bees:23
Machine5
Dog: 3Dog:9
Farmer1: 7
Machine4
Animals: 1,2,3,12Bees:23
Machine5
Dog: 3,9Farmer1: 7
In our example‣ Map: (doc-num, text) ➝ [(word, doc-num)]
‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]
General form:‣ Two functions: Map and Reduce
‣ Operate on key and value pairs
‣ Map: (K1, V1) ➝ list(K2, V2)
‣ Reduce: (K2, list(V2)) ➝ (K3, V3)
‣ Primitives present in Lisp and other functional languages
Same principle extended to distributed computing‣ Map and Reduce tasks run on distributed sets of machines
This is Map-Reduce
Abstracts functionality common to all Map/Reduce applications‣ Distribute tasks to multiple machines‣ Sorts, transfers and merges intermediate data from all machines from the
Map phase to the Reduce phase‣ Monitors task progress‣ Handles faulty machines, faulty tasks transparently
Provides pluggable APIs and configuration mechanisms for writing applications‣ Map and Reduce functions‣ Input formats and splits‣ Number of tasks, data types, etc…
Provides status about jobs to users
Map-Reduce Framework
MR – Architecture
Job Client Job Tracker
DFS ClientDFS ClientDFS ClientDFS ClientDFS ClientDFS ClientTask Tracker
Heartbeat Task Assignment
Shuffle
Submit
Progress
HDFS
•All user code runs in isolated JVM
•Client computes splits
•JT just schedules these splits (one mapper per split)
•Mapper, Reducer, Partitioner and Combiner and any custom Input/OutputFormat runs in user JVM
•Idempotence
Map-Reduce
Hadoop HDFS + MR cluster
Machines with Datanodes and Tasktrackers
D D D DTT
JobTracker
Namenode
T T TD
Client
Submit Job
HTTP Monitoring UI Get Block Locations
•Input: A bunch of large text files
•Desired Output: Frequencies of Words
WordCount: Hello World of Hadoop
Hadoop – Two services in one
Mapper‣ Input: value: lines of text of input
‣ Output: key: word, value: 1
Reducer‣ Input: key: word, value: set of counts
‣ Output: key: word, value: sum
Launching program‣ Defines the job
‣ Submits job to cluster
Word Count Example
Questions ?
Thank You!
mailto: [email protected]