Hadoop and HDFS in CMRI
China Mobile Research InstituteWANG, Xu [wangxu(at)chinamobile.com]
Apache Hadoop
http://hadoop.apache.org/Open source clone of Google infrastructureDe facto standards of MapReduce framework, win Terasort several timesSearch Engine, Data Mining, Log Analyzing
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Search Engine, Data Mining, Log AnalyzingClusters scale up to 4,000 nodesYahoo!, Facebook, ClouderaBaidu, Alibaba, China Mobile
Hadoop in China 2009
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
BeijingNov 15, 2009
Subprojects of Hadoop
PigHBase
(BigTable)ZooKeeper(Chubby)
Hive
Hadoop
Data Warehouse
K-K-V Store / Column based
DB
Distributed Lock
Basic Platform
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
JVM
Hadoop Common(io, ipc….)
HDFS (Google GFS)
MapReduce (Google MapReduce)Hadoop Core
Avro (ipc)
Serialized Data Format
&RPC
HDFS Principles
Follow Google GFS PaperFor Big data storage and processingWrite once, read frequently
Modify is not permitted, append will be support soonRead is prior to writing
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Read is prior to writing
Working on commodity PCHardware may fail anytimeMultiple replicas for data safety
HDFS Architecture
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Data in HDFS NameNode’s Memory
Namespace InfoFS Hierarchical TreeMap(file, blocks)
DataNode MapMap(living datanode, blocks)
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Map(living datanode, blocks)
Blocks MapMap(block, file/datanodes)
Other runtime infoLock holding by clientsBlocks being processed (replication, invalid…)
Persistence of NameNode data
NameNode persistenceNamespace: FSImage & EditLogStarting & Shutdown
Secondary NameNodeCheckpoint (merge EditLog into FSImage)Periodically work (1 hour by default)
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Periodically work (1 hour by default)
Backup NameNodeIntroduced In 0.21 (not release yet)“Real time Secondary NameNode” or Remote Editlog
DataNode Map and other Info only exists in NameNode Memory
High Availability Considerations
Availability in MainstreamSPOF in NameNode, Fail of NameNode may cause
Service interruption for minutesData loss for a ckpt period (worst case)
Possible Solution: DRBD+Linux-HAMature fail over mechanismService interruption for minutesService interruption for minutes
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Service interruption for minutesService interruption for minutesAlmost no data loss
Another Solution: NameNode Cluster ExtensionService continuousAlmost no data lossModify the codeModify the codeConsistency vs. PerformanceConsistency vs. Performance
HDFS+NNC Architecture
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
NNC Design
Master & Slave: 1:NMaster synchronize the FSNamesystem to slavesZookeeper works as a registry, client and datanode can lookup namenode list from it.DFSClient can
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
DFSClient can access multiple namenode for reading operationFailover is controlled by linux-HA by far, which get namenode status info from ClientProtocol
Update Events
NNU_NOP // nothing to do NNU_BLK // add or remove a blockNNU_INODE // add or remove or modify an inode (add or remove file; new block allocation)NNU_NEWFILE // start new fileNNU_CLSFILE // close new fileNNU_MVRM // move or remove file NNU_MKDIR // mkdirNNU_LEASE // add/update or release a leaseNNU_LEASE_BATCH //update batch of leases
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
NNU_LEASE_BATCH //update batch of leasesNNU_DNODEHB_BATCH //batch of datanode heartbeatNNU_DNODEREG // dnode register NNU_DNODEBLK // block reportNNU_DNODERM // remove dnodeNNU_BLKRECV // block received message from datanodeNNU_REPLICAMON //replication monitor workNNU_WORLD //bootstrap a slave nodeNNU_MASSIVE //bootstrap a slave node
Performance and Other Issues
The overhead of NameNode synchronizationFor typical file IO and MapReduce (sort, wordcount)
NNC system reaches 95% performance of hadoop without NNC
For meta data write only operation (parallel touchz or mkdir)NNC system reaches 15% performance of hadoop without NNC
Performance gaining of Multiple NameNode in read-only operationCannot observed till now, unfortunately
Other design issue
内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密
Other design issueWhy from master to slaves directly without an additional delivery node?
That may introduce another SPOF, and make the problem more complex.
Why don’t use Zookeeper for failover?Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions?
Q & A