HDFS introduction

++Hadoop 기본 과정

++Overview

HDFSHDFSHDFSHDFS

ImpalaImpalaImpalaImpala

MapReduceMapReduceMapReduceMapReduce

CascadingCascadingCascadingCascading HiveHiveHiveHive

++Big Data for What?

Service

CAP Theorem, Fast Response ,Scale Out , Schema Free ...

Distributor with RDBMS

MongoDB , HBASE , CouchDB ...

Analysis

Hadoop <--- today’s topic!!!

++What’s Hadoop

Consist ofHDFS (Hadoop Distributed File System)

MapReduce

++HDFS Architecture

master

namenode

bunch of datanode

NameNodNameNodee

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

++single master

Strong Point

simple architecture

master have global knowledge.

file and block namespace (memory and disk)

mapping from files to blocks (memory and disk)

location of each block’s replicas ( only memory)

master can make sophisticated decisions.

++single master

Weak PointSPOF(= single point of failure )

bottleneck

minimizing master’s involvement is important

++Fast Recovery for NameNode

Secondary Namenode

crawls namenode’s operation log

maintains namenode’s data

NameNodeNameNodeNameNodeNameNode

Secondary Secondary NameNodeNameNodeSecondary Secondary NameNodeNameNode

++HA for NameNode

active namenode

do normal namenode’s operation

standby namenode

maintain namenode’s data

ready to be active namenode

NameNode(active)NameNode(active)NameNode(active)NameNode(active)

NameNode(standby)NameNode(standby)NameNode(standby)NameNode(standby)

++block

each file consists of blocks

sizedefault 64M

replication ( default 3 )

++write operation

client send ‘write request’ to namenode

namenode lock file and select datanode to be written.

namenode response datanode list to client.

client send file content to datanode.

datanode store file and relay to other datanode.

finally client send close request to namenode.

namenode release write lock

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

clientclientclientclient

write lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanodewrite lock & allocate datanode

++read operation

client send ‘read request’ to namenode

namenode lock file and select datanode to be written.

namenode response datanode list to client.

client send read request to datanode.

datanode send content to client

finally client send close request to namenode.

namenode release read lock

DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode DataNodeDataNodeDataNodeDataNode

clientclientclientclient

read lock read lock read lock read lock

++block(again)

reason to use big-size-block reduce client’s need to interact with namenode

reduce the size of metadata stored on namenode

++namenode’s operation

namespace management and locking

replica placement

creation, re-replication, rebalancing

garbage collection

stale replica detection

++namespace management and locking

goalensure proper serialization

use read lock/write lock

++block replica placement

maximize data reliability and availability

maximize network bandwidth utilization

default strategy is ...

one on same datanode.

one on other datanode in same rack.

one on other datanode in other rack.

++creation, re-replication, rebalancing

creation

client create new files

consider

disk space utilization

number of recent creation

spread replicas

re-replication

number of available replica falls below proper goal

datanode down, replica corruption ...

rebalancing

move replicas for better disk space and load balancing

++garbage collection

what’s garbage?

block not in namenode’s metadata.

mechanism

when exchanging HeartBeat with namenode, datanode reports subset of block it has.

master replies with garbage blocks.

datanode deletes grabage blocks.

++stale replica detection

mechanismstoring with generation timestamp.

when restarting, datanode reports its set of blocks with its generation timestamp

++Datanode’s operation

check data integritydatanode use checksumming to detect corruption.

++filesystem api

hdfs provide basic linux utilities.ex)

hdfs dfs -mkdir -p /foo

hdfs dfs -ls /foo

hdfs dfs -cat /foo/bar.txt

hdfs dfs -rm -r /foo

native library?

thanks ....

HDFS introduction

Technology

Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Understanding hdfs

Hdfs java api

Federated HDFS

CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HDFS · 2020-03-12 · HDFS High Availability has features to cope with this ... ¤Previously active namenodethinks it is stillthe active

HDFS Comic

Inside HDFS Append

Hdfs comics

HDFS & MapReduce

Introduction to Hadoop and HDFS. Table of Contents Hadoop – Overview Hadoop Cluster HDFS

Ingesting hdfs intosolrusingsparktrimmed

hdfs Documentation

Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The

Hdfs connector

HDFS Design Principles

hdfs3 Documentation · 2019. 4. 2. · CHAPTER 2 Introduction Use HDFS natively from Python. TheHadoop File System(HDFS) is a widely deployed, distributed, data-local ﬁle system

Gfs vs hdfs

5.managing hdfs

DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

HDFS vs CFS