Download pdf - 20100130 hardoop apache

Hadoop and HDFS in CMRI

China Mobile Research InstituteWANG, Xu [wangxu(at)chinamobile.com]

Apache Hadoop

http://hadoop.apache.org/Open source clone of Google infrastructureDe facto standards of MapReduce framework, win Terasort several timesSearch Engine, Data Mining, Log Analyzing

内部资料内部资料内部资料内部资料注意保密注意保密注意保密注意保密

Search Engine, Data Mining, Log AnalyzingClusters scale up to 4,000 nodesYahoo!, Facebook, ClouderaBaidu, Alibaba, China Mobile

Hadoop in China 2009


BeijingNov 15, 2009

Subprojects of Hadoop

PigHBase

(BigTable)ZooKeeper(Chubby)

Hive

Hadoop

Data Warehouse

K-K-V Store / Column based

DB

Distributed Lock

Basic Platform


JVM

Hadoop Common(io, ipc….)

HDFS (Google GFS)

MapReduce (Google MapReduce)Hadoop Core

Avro (ipc)

Serialized Data Format

&RPC

HDFS Principles

Follow Google GFS PaperFor Big data storage and processingWrite once, read frequently

Modify is not permitted, append will be support soonRead is prior to writing


Read is prior to writing

Working on commodity PCHardware may fail anytimeMultiple replicas for data safety

HDFS Architecture


Data in HDFS NameNode’s Memory

Namespace InfoFS Hierarchical TreeMap(file, blocks)

DataNode MapMap(living datanode, blocks)


Map(living datanode, blocks)

Blocks MapMap(block, file/datanodes)

Other runtime infoLock holding by clientsBlocks being processed (replication, invalid…)

Persistence of NameNode data

NameNode persistenceNamespace: FSImage & EditLogStarting & Shutdown

Secondary NameNodeCheckpoint (merge EditLog into FSImage)Periodically work (1 hour by default)


Periodically work (1 hour by default)

Backup NameNodeIntroduced In 0.21 (not release yet)“Real time Secondary NameNode” or Remote Editlog

DataNode Map and other Info only exists in NameNode Memory

High Availability Considerations

Availability in MainstreamSPOF in NameNode, Fail of NameNode may cause

Service interruption for minutesData loss for a ckpt period (worst case)

Possible Solution: DRBD+Linux-HAMature fail over mechanismService interruption for minutesService interruption for minutes


Service interruption for minutesService interruption for minutesAlmost no data loss

Another Solution: NameNode Cluster ExtensionService continuousAlmost no data lossModify the codeModify the codeConsistency vs. PerformanceConsistency vs. Performance

HDFS+NNC Architecture


NNC Design

Master & Slave: 1:NMaster synchronize the FSNamesystem to slavesZookeeper works as a registry, client and datanode can lookup namenode list from it.DFSClient can


DFSClient can access multiple namenode for reading operationFailover is controlled by linux-HA by far, which get namenode status info from ClientProtocol

Update Events

NNU_NOP // nothing to do NNU_BLK // add or remove a blockNNU_INODE // add or remove or modify an inode (add or remove file; new block allocation)NNU_NEWFILE // start new fileNNU_CLSFILE // close new fileNNU_MVRM // move or remove file NNU_MKDIR // mkdirNNU_LEASE // add/update or release a leaseNNU_LEASE_BATCH //update batch of leases


NNU_LEASE_BATCH //update batch of leasesNNU_DNODEHB_BATCH //batch of datanode heartbeatNNU_DNODEREG // dnode register NNU_DNODEBLK // block reportNNU_DNODERM // remove dnodeNNU_BLKRECV // block received message from datanodeNNU_REPLICAMON //replication monitor workNNU_WORLD //bootstrap a slave nodeNNU_MASSIVE //bootstrap a slave node

Performance and Other Issues

The overhead of NameNode synchronizationFor typical file IO and MapReduce (sort, wordcount)

NNC system reaches 95% performance of hadoop without NNC

For meta data write only operation (parallel touchz or mkdir)NNC system reaches 15% performance of hadoop without NNC

Performance gaining of Multiple NameNode in read-only operationCannot observed till now, unfortunately

Other design issue


Other design issueWhy from master to slaves directly without an additional delivery node?

That may introduce another SPOF, and make the problem more complex.

Why don’t use Zookeeper for failover?Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions?

Q & A