Click here to load reader
Upload
huguk
View
10.439
Download
1
Tags:
Embed Size (px)
Citation preview
Apache Hadoop
India Summit 2011
HDFS FederationSanjay Radia, Hadoop Architect
Yahoo! Inc
1
Outline
• HDFS - Quick overview
• Scaling HDFS - FederationHDFS Distributed file
system
MapReduce Distributed
computation
HBase Column store
Pig Dataflow language
Hive Data warehouse
Zookeeper Distributed
coordination
Avro Data Serialization
Oozie Workflow
Hadoop Components
3
4
HDFS
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Namenode
Namespace Metadata &
Journal
Namespace
State
Block
Map
Heartbeats & Block Reports
Block ID Block Locations
Datanodes
Block ID Data
Backup
Namenode
Hierarchal NamespaceFile Name BlockIDs
Horizontally Scale IO and Storage
5
HDFSClient reads and writes
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Client Client
Namenode1 open
2 read2 write
1 create
writewrite
Datanodes
Namespace
State
Block
Map
End-to-end checksum
6
HDFS Architecture :
Computation close to the data
DataData data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
ResultsData data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
Block 1
Block 1
Block 2
Block 2
Block 2
Block 1
MAP
MAP
MAP
Reduce
Block 3
Block 3
Block 3
Quiz: What Is the Common Attribute?
7
HDFSActively maintain data reliability
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Namenode
2. copy
3.
blockReceived
1.
replicate
Datanodes
Bad/lost
block replica
Periodically
check block
checksums
Namespace
State
Block
Map
Hadoop at Yahoo!
9
99.85
99.47
99.69
99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9
Production
Research
Sandbox
Availability SLA
0
50,000
100,000
150,000
200,000
250,000
2006 -Qtr1
2006 -Qtr2
2006 -Qtr3
2006 -Qtr4
2007 -Qtr1
2007 -Qtr2
2007 -Qtr3
2007 -Qtr4
2008 -Qtr1
2008 -Qtr2
2008 -Qtr3
2008 -Qtr4
2009 -Qtr1
2009 -Qtr2
2009 -Qtr3
2009 -Qtr4
2010 -Qtr1
2010 -Qtr2
2010 -Qtr3
Total Nodes = 43,936
Total Storage = 206 PB
13,687
22,334
7,803
0 5000 10000 15000 20000 25000
Production
Research
Sandbox
Nodes running Hadoop at Yahoo!
Over 43,000 nodes running Hadoop
1M+ Monthly Hadoop Jobs
Scaling Hadoop
Early Gains
• Simple design allowed rapid improvements
• Namespace is all in RAM, simpler locking
• Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)
Growth of number of files and storage is limited by adding RAM to namenode
• 50G heap = 200M “fs objects” = 100M names + 100MBlocks
• 14PB of storage (50MB blocksize)
• 4K nodes
- Job Tracker carries out both job lifecycle management and scheduling
Yahoo’s Response:
• HDFS Federation: horizontal scaling of namespace (0.22)
• Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker
Goal:
• Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster
6 May 201010
11
Scaling the Name Service:
Options
Partial
NS (Cache)
in memory
Multiple
Namespace
volumes
All NS
in memoryArchives
# names
# clients
100M 10B200M 1B 2B
1x
4x
20x
50x
Partial
NS in memory
With Namespace
volumes
Not to scale
100x Good isolation
properties
Separate Bmaps from NN
20B
Block-reports for Billions of
blocks requires rethinking
block layer
Distributed NNs
Opportunity:Vertical & Horizontal scaling
Horizontal scaling/federation benefits:
– Scale
– Isolation, Stability, Availability
– Flexibility
– Other Namenode implementations or non-HDFS namespaces
12
Namenode Horizontal: Federation
Vertical scalingMore RAM, Efficiency in memory usage
First class archives (tar/zip like)
Partial namespace in main memory
Block (Object) Storage Subsystem
13
NS1 Foreign NS n
... ...NS k
Na
me
sp
ace
Datanode 1 Datanode 2 Datanode m... ... ...
Balancer
Block Pools
Pools nPools kPools 1
Blo
ck s
tora
ge
Block (Object) Storage Subsystem
• Shared storage provided as pools of blocks
• Namespaces (HDFS, others) use one or more block-pools
• Note: HDFS has 2 layers today – we are generalizing/extending it.
1st Phase:
B-Pool management inside Namenode
14
Datanode 1 Datanode 2 Datanode m... ... ...
NS1 Foreign NS n
... ...NS k
Balancer
Block Pools
Pools nPools kPools 1
NN-1 NN-k NN-n
Future: Move Block
mgt into separate
nodes
Future:
Move block management out
15
Datanode 1 Datanode 2 Datanode m... ... ...
Balancer
Block Pools
Pools nPools kPools 1
NS1Foreign
NS n... ...
NS k
Block Managerclient
1. Open
2. getBlockLocations
3. ReadBlock
Easier to scale horizontally
than the name server
What is a HDFS Cluster
Current• HDFS Cluster
– 1 Namespace
– A set of blocks
• Implemented as
– 1 Namenode
– Set of DNs
New• HDFS Cluster
– N Namespaces
– Set of block-pools• Each block-pool is set of blocks
• Phase 1: 1 BP per NS
– Implies N block-pools
• Implemented as
– N Namenode
– Set of DNs• Each DN stores the blocks for
each block-pool
16
Managing Namespaces
• Federation has multiple namespaces – don’t you need a single global namespace?
– Key is to share the data and the names used to access the shared data.
• A global namespace is one way to do that – but even there we talk of several large “global” namespaces
• Client-side mount table is another way to share
– Shared mount-table => “global” shared view
– Personalized mount-table => per-application view
• Share the data that matter by mounting it
hom
eprojectdata tmp
/ Client-side
mount-table
18
HDFS Federation Across Clusters
home
projectdata
tmp
/
Cluster 1
home
projectdata
tmp
/
Cluster 2
Application
mount-
table in
Cluster 1
Application
mount-
table in
Cluster 2
Nameserver as container for namespaces
• Nameserver as a container for namespaces
• Each namespace with its own separate state• Persistent state in shared storage (e.g. Book Keeper)
• Each nameserver serves a set of namespaces
• Selected based on isolation and capacity
• A namespace can be moved between nameserver
19
…
…
Shared persistent storage for namespace metadata(e.g. Book keeper)
Nameserver Nameserver
Summary
Federated HDFS (Jira HDFS-1052)
• Scale by adding independent Namenodes
• Preserves the robustness of the Namenodes
• Not much code change to the Namenode
• Generalizes the Block storage layer
• Analogous to Sans & Luns
• Can add other implementations of the Namenodes
• Even other name services (HBase?)
• Could move the Block management out of the Namenode in the future
• But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block reports
• Benefits
• Scale number of file names and blocks
• Improved isolation and hence availability
6 May 201020
Q & A
21