View
136
Download
0
Category
Preview:
DESCRIPTION
Citation preview
++Hadoop 기본 과정
++Overview
HDFSHDFSHDFSHDFS
ImpalaImpalaImpalaImpala
MapReduceMapReduceMapReduceMapReduce
CascadingCascadingCascadingCascading HiveHiveHiveHive
++Big Data for What?
Service
CAP Theorem, Fast Response ,Scale Out , Schema Free ...
Distributor with RDBMS
NoSQL
MongoDB , HBASE , CouchDB ...
Analysis
Hadoop <--- today’s topic!!!
++What’s Hadoop
Consist ofHDFS (Hadoop Distributed File System)
MapReduce
++HDFS Architecture
master
namenode
slave
bunch of datanode
NameNodNameNodee
NameNodNameNodee
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
++single master
Strong Point
simple architecture
master have global knowledge.
file and block namespace (memory and disk)
mapping from files to blocks (memory and disk)
location of each block’s replicas ( only memory)
master can make sophisticated decisions.
++single master
Weak PointSPOF(= single point of failure )
bottleneck
minimizing master’s involvement is important
++Fast Recovery for NameNode
Secondary Namenode
crawls namenode’s operation log
maintains namenode’s data
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
Secondary Secondary NameNodeNameNodeSecondary Secondary NameNodeNameNode
++HA for NameNode
active namenode
do normal namenode’s operation
standby namenode
maintain namenode’s data
ready to be active namenode
NameNode(active)NameNode(active)NameNode(active)NameNode(active)
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)
++block
each file consists of blocks
sizedefault 64M
replication ( default 3 )
++write operation
client send ‘write request’ to namenode
namenode lock file and select datanode to be written.
namenode response datanode list to client.
client send file content to datanode.
datanode store file and relay to other datanode.
finally client send close request to namenode.
namenode release write lock
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
clientclientclientclient
write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode
++read operation
client send ‘read request’ to namenode
namenode lock file and select datanode to be written.
namenode response datanode list to client.
client send read request to datanode.
datanode send content to client
finally client send close request to namenode.
namenode release read lock
NameNodeNameNodeNameNodeNameNode
DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode
clientclientclientclient
read lock read lock read lock read lock
++block(again)
reason to use big-size-block reduce client’s need to interact with namenode
reduce the size of metadata stored on namenode
++namenode’s operation
namespace management and locking
replica placement
creation, re-replication, rebalancing
garbage collection
stale replica detection
++namespace management and locking
goalensure proper serialization
use read lock/write lock
++block replica placement
goal
maximize data reliability and availability
maximize network bandwidth utilization
default strategy is ...
one on same datanode.
one on other datanode in same rack.
one on other datanode in other rack.
++creation, re-replication, rebalancing
creation
client create new files
consider
disk space utilization
number of recent creation
spread replicas
re-replication
number of available replica falls below proper goal
datanode down, replica corruption ...
rebalancing
move replicas for better disk space and load balancing
++garbage collection
what’s garbage?
block not in namenode’s metadata.
mechanism
when exchanging HeartBeat with namenode, datanode reports subset of block it has.
master replies with garbage blocks.
datanode deletes grabage blocks.
++stale replica detection
mechanismstoring with generation timestamp.
when restarting, datanode reports its set of blocks with its generation timestamp
++Datanode’s operation
check data integritydatanode use checksumming to detect corruption.
++filesystem api
hdfs provide basic linux utilities.ex)
hdfs dfs -mkdir -p /foo
hdfs dfs -ls /foo
hdfs dfs -cat /foo/bar.txt
hdfs dfs -rm -r /foo
++etc
raid?
native library?
++end
thanks ....
Recommended