Big data- HDFS(2nd presentation)

Presentation on Big Data-HADOOP DISTRIBUTED FILE

SYSTEM(H.D.F.S)

Presented by –

TAKRIM UL ISLAM LASKAR(120103006)

Who uses Hadoop?

• Amazon/A9 • Facebook • Google • IBM : Blue Cloud? • Joost• Last.fm • New York Times • PowerSet• Veoh• Yahoo!

HADOOP DISTRIBUTED FILE SYSTEM(HDFS)

Highly fault tolerant

High throughput

Suitable for application with large data sets

Streaming access to file system data

Can be built out of commodity hardware

Commodity Hardware

Typically in 2 level architecture– Nodes are commodity PCs – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit

Why DFS?

1 MACHINE• 4 I/O channels• Each channel- 100mb/s

10 MACHINES• 4 I/O channels• Each channel- 100mb/s

…...

45 Minutes 4.5 Minutes

READ 1TB DATA

Goals of HDFS

• Very Large Distributed File System

– 10K nodes, 100 million files, 10 PB

• Assumes Commodity Hardware

– Files are replicated to handle hardware failure

– Detect failures and recovers from them

• Optimized for Batch Processing

– Data locations exposed so that computations can move to where data resides

– Provides very high aggregate bandwidth

HDFS Architecture

Functions of a NameNode

• Manages File System Namespace

– Maps a file name to a set of blocks

– Maps a block to the DataNodes where it resides

• Cluster Configuration Management

• Replication Engine for Blocks

NameNode Metadata

• Meta-data in Memory

– The entire metadata is in main memory (RAM)

• Types of Metadata

– List of files

– List of Blocks for each file

– List of DataNodes for each block

– File attributes, e.g creation time, replication factor

• A Transaction Log

– Records file creations, file deletions. etc

DataNode

• A Block Server

– Stores data in the local file system (e.g. ext3)

– Stores meta-data of a block

– Serves data and meta-data to Clients

• Block Report

– Periodically sends a report of all existing blocks to the NameNode• Facilitates Pipelining of Data

– Forwards data to other specified DataNodes

Block Placement

• Current Strategy

-- One replica on local node

-- Second replica on a remote rack

-- Third replica on same remote rack

-- Additional replicas are randomly placed

• Clients read from nearest replica.

Heartbeats

• DataNodes send heartbeat to the NameNode

– Once every 3 seconds

• NameNode used heartbeats to detect DataNode failure

Replication Engine

• NameNode detects DataNode failures

– Chooses new DataNodes for new replicas

– Balances disk usage

– Balances communication traffic to DataNodes

HDFS- 2nd generation

NameNode-

Active NameNode

Passive NameNode

Data Pipelining

• Client retrieves a list of DataNodes on which to place replicas of a block

• Client writes block to the first DataNode

• The first DataNode forwards the data to the next DataNode in the Pipeline

• When all replicas are written, the Client moves on to write the next block in file

Secondary NameNode

• Copies FsImage and Transaction Log from NameNode to a temporary directory

• Merges FSImage and Transaction Log into a new FSImage in temporary directory

• Uploads new FSImage to the NameNode

– Transaction Log on NameNode is purged

User Interface

• Command for HDFS User:

– hadoop dfs -mkdir /foodir

– hadoop dfs -cat /foodir/myfile.txt

– hadoop dfs -rm /foodir myfile.txt

• Command for HDFS Administrator

– hadoop dfsadmin -report

– hadoop dfsadmin -decommission datanodename

• Web Interface

– http://host:port/dfshealth.jsp

Hadoop Map/Reduce

• The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework

reduce | output

• Natural for: – Log processing – Web search indexing – Ad-hoc queries

References :

1. Youtube Lecture video on chennal ‘ Training on Big Data and Hadoop ’ By User ‘Edureka’.

2. ‘White Book Of Big Data’ By ‘Fujistu’ .

3. ‘Big Data For Dummies’ by ‘A Wiley Brand’ .

4. Research paper by ‘Kalapriya Kannan’ in ‘IBM Research Labs’.

5. Collected data from “slideshare.com”

Useful Links

• HDFS Design:

– http://hadoop.apache.org/core/docs/current/hdfs_design.html

• Hadoop Information

– http://hadoop.apache.org/

• Hadoop API:

• – http://hadoop.apache.org/core/docs/current/api/

-THANK YOU.I appreciate your patience.

Big data- HDFS(2nd presentation)

Technology

Big Data Processing for Real-time Consumer …...Big Data Processing for Real-time Consumer Engagement Business Big Data Tools Hadoop Ecosystem, Oozie, Solr, Sqoop, Hive, HDFS, Hbase,

Big Data Technologies BigData - CentraleSupélec Metz · Big Data façon Hadoop : ... Le NameNodeconserve la cartographie du HDFS + les évolutions (les «logs») Permet de savoir

Hadoop Distributed File System(HDFS) - Big Data · 2018-01-31 · Hadoop Distributed File System(HDFS) Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi

Big picture 2nd airport

BIG DATA ÉS GÉPI TANULÁS KÖRNYEZET AZ MTA Cloud.pdf · Cloud Public IP VM[1] HDFS DATA –TEXT FILES VM[2] HDFS VM[10] HDFS VM[1] SPARK WORKER VM[2] SPARK WORKER VM[10] SPARK

HDFS: Hadoop Distributed File Systemeecs.csuohio.edu/~sschung/cis612/LectureNotes_HadoopFinal_1.pdf · Hadoop Distributed File System (HDFS) p: HDFS • HDFS Consists of data blocks

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase)

Accelerating Big Data with Hadoop (HDFS, MapReduce and ... · •Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached •Challenges in Accelerating Enterprise Middleware •Designs

Enabling Multi-pipeline Data Transfer in HDFS for Big Data … · 2020. 1. 6. · Data Transfer in HDFS for Big Data Applications Liqiang (Eric) Wang, Hong Zhang University of Wyoming

BIG DATA HADOOP FULL - themeslearning.com · Learning Big Data and Hadoop This module will help you understand Big Data Common Hadoop ecosystem components Hadoop Architecture HDFS

Introduction to Hadoop, MapReduce and HDFS for Big Data - SNIA

Human Development & Family Science (HDFS)catalog.okstate.edu/courses/hdfs/hdfs.pdf · 2020. 9. 3. · 2 Human Development & Family Science (HDFS) HDFS 2233 Development of Creative

Big Data Ingestion with Kafka -> HDFS using Apache Apex

New Big Data Architectures · 2018. 11. 26. · Introduction to Big Data NoSQL data storage Big Data ingest architectures GFS, HDFS, Hadoop Types of data processing Big Data architectures

Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data

XHAMI – extended HDFS and MapReduce interface for Big Data ... · XHAMI – extended HDFS and MapReduce interface for Big Data image processing applications in cloud computing environments

Overview of the HiBD (High-Performance Big Data) Projectneurohpc.cse.ohio-state.edu/static/media/talks/slide/hibd-luxi_1.pdf · HDFS and MapReduce in Apache Hadoop • HDFS: Primary

Big Data Integration -SnapLogic-€¦ · Big Data Analytics - SnapLogic • Basic Understanding of HDFS& MapReduce • Linux commands are not required –Drag & Drop • Not much