49
Taming Big Data with Hadoop Kul Subedi

Hadoop-2.6.0 Slides

Embed Size (px)

Citation preview

Page 1: Hadoop-2.6.0 Slides

Taming Big Data with HadoopKul Subedi

Page 2: Hadoop-2.6.0 Slides

Introduction

● What is Big Data?

Page 3: Hadoop-2.6.0 Slides

Properties of Big Data

Page 4: Hadoop-2.6.0 Slides

Cont...

● Large and growing data files● Commonly measured in terabytes or

petabytes● Unstructured data● May not fit in a “relational database model”● Derived from Users, Applications, Systems,

and Sensors

Page 5: Hadoop-2.6.0 Slides

Problem: Data Throughput Mis-match

● Standard spinning hard drive 60-120 MB/sec● Standard solid state hard drive 250-500

MB/sec● Hard drive capacity growing● Online data growth

Moving data on and off of disk is the bottleneck.

Page 6: Hadoop-2.6.0 Slides

Cont...● One Terabyte (TB) of data will take 10,000

seconds (approximately 167 minutes)● One TB of data will take 2000 seconds

(approximately 33 minutes) to read data at 500 MB/sec (solid state)

One TB is a “small” file sizeThe need for parallel data access is essential for Big data.

Page 7: Hadoop-2.6.0 Slides

Problem(1): Scaling

Page 8: Hadoop-2.6.0 Slides

Agenda● Hadoop Definition● Hadoop Ecosystem● History● Hadoop Design Principle● HDFS and MapReduce (Demo)● Conclusion

Page 9: Hadoop-2.6.0 Slides

Definition

● A framework of open-source tools, libraries, and methodologies for the distributed processing of large data sets

● Scale up from single servers to thousands of machines, each offering local computation and storage

Page 10: Hadoop-2.6.0 Slides
Page 11: Hadoop-2.6.0 Slides

Cont...

● The project includes ❏ Hadoop Common❏ HDFS❏ YARN❏ MapReduce

Page 12: Hadoop-2.6.0 Slides

Cont...

● Other Hadoop-related projects❏ Pig❏ Hive❏ Tez❏ Spark❏ HBase❏ Ambari etc

Page 13: Hadoop-2.6.0 Slides

Hadoop Usage Modes● Administrators❏ Installation❏ Monitor/Manage System❏ Tune System

● End Users❏ Design MapReduce Applications❏ Import/Export Data❏ Work with various Hadoop Tools

Page 14: Hadoop-2.6.0 Slides

Hadoop History

● Developed by Doug Cutting and Michael J. Cafarella

● Based On Google MapReduce technology● Designed to handle large amounts of data

and be robust● Donated to Apache Software Foundation in

2006 by Yahoo

Page 15: Hadoop-2.6.0 Slides

Cont...

Application areas:❏ Social media❏ Retail❏ Financial services❏ Web Search❏ Everywhere there is large amounts of

unstructured data

Page 16: Hadoop-2.6.0 Slides

Cont...

Prominent Users:❏ Yahoo!❏ Facebook❏ Amazon❏ Ebay❏ American Airlines ❏ The New York Time, and many others

Page 17: Hadoop-2.6.0 Slides

Design Principles

● Moving computation is cheaper than moving data

● Hardware will fail, manage it● Hide execution details from the user● Use streaming data access● Use a simple file system coherency model

Page 18: Hadoop-2.6.0 Slides

Cont...

What Hadoop is not: A replacement for SQL, always fast and efficient, good for quick ad-hoc querying

Page 19: Hadoop-2.6.0 Slides

HDFS

Page 20: Hadoop-2.6.0 Slides

HDFS-Architecture1.Where do I read or write data?2.Use these data nodes.

Page 21: Hadoop-2.6.0 Slides

NameNode

● Only one per cluster “master node”● Stores meta information of “filesystem” such

as Filename, permission, directories, blocks● Keep in RAM for fast access● Persisted to disk● The namenode is the brain of the outfit

Page 22: Hadoop-2.6.0 Slides

DataNode

● Many per cluster, “slave node”● Stores individual file “blocks” but knows

nothing about them, accept the block name● Reports regularly to NameNode “Hey I am alive, and I have these blocks”

Page 23: Hadoop-2.6.0 Slides

HDFS Presents

● Transparency● Replication

Page 24: Hadoop-2.6.0 Slides
Page 25: Hadoop-2.6.0 Slides

HDFS Properties● Files are immutable

No updates, no appends● Disk access is optimized for sequential

readsStore data in large “blocks” 128 MB default

● Avoid Corruption“blocks” are verified with checksum when

stored and read

Page 26: Hadoop-2.6.0 Slides

Cont...● High throughput

Avoid contention, have system share as little information and resources as possible

● Fault TolerantLoss of a disk, or machine, or rack of

machines should not lead to data loss

Page 27: Hadoop-2.6.0 Slides

Client Reading From HDFS

Page 28: Hadoop-2.6.0 Slides

Client Writing To HDFS

Page 29: Hadoop-2.6.0 Slides

Demo● File System Health check❏ hdfs fsck /

● List the file system content❏ $hdfs dfs -ls /

● Create a directory❏ $hdfs dfs -mkdir /data1

● Upload file to HDFS❏ $hdfs dfs -put input.txt /in

Page 30: Hadoop-2.6.0 Slides

Cont...

● Input directory in HDFS: /in● Output directory in HDFS: /output

Page 31: Hadoop-2.6.0 Slides

NameNode High Availability

● NFS Filer● Quorum Journal Manager (QJM)

Page 32: Hadoop-2.6.0 Slides
Page 33: Hadoop-2.6.0 Slides

MapReduce● A programming model for processing large

Data in Distributed fashion over cluster of commodity machines

● Introduced by Google● Uses two key steps: mapping and reducing

Page 34: Hadoop-2.6.0 Slides

Cont...● Almost all data can be mapped into <key,

value> pairs somehow● Your keys and values may be of any type:

string, integers, dummy types, and <K,V> pairs themselves and so on

● Scale-free programming❏ If a program works for a 1KB file, it can

work for any file size

Page 35: Hadoop-2.6.0 Slides

see spot run run spot runsee the cat

run spot run

see spot run

see the cat

see,1 spot,1 run,1

see,1 see,1

spot,1 spot,1

run,1 run,1run,1

the,1

cat,1

InputSplit

Map

Shuffle

see,1 spot,1 run,1

see,1 the,1 cat,1

see,2

spot,2

run,3

the,1

cat,1

Reduce

see,2 spot,2 run,3the,1 cat,1

Output

Data Flow

MapReduce Word Count Data Flow

Page 36: Hadoop-2.6.0 Slides

Example: Hello World Program

● wget www.gutenberg.org/files/2600/2600.txt● python mapper.py < input.txt | sort | python

reducer.py● cat *.txt | python mapper.py | sort | python

reducer.py

Page 37: Hadoop-2.6.0 Slides

Demo

● cat input1.txt | python mapper.py● cat input1.txt | python mapper.py | sort● cat input1.txt | python mapper.py | sort |

python reducer.py● cat input1.txt | python mapper.py | sort |

python reducer.py > output.txt

Page 38: Hadoop-2.6.0 Slides

How to run job in Cluster?

● Using Streaming Interface❏ hadoop jar /opt/hadoop/hadoop-2.6.0

/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -input /in/input.txt -output /output/run1

Page 39: Hadoop-2.6.0 Slides

Web Interfaces

● NameNode: http://10.0.0.160:50070● ResourceManager: http://10.0.0.160:8088● /opt/hadoop/hadoop-2.6.0

/share/hadoop/hdfs/webapps/hdfs contains view

Page 40: Hadoop-2.6.0 Slides

Prerequisites

● Java● ssh● rsync

Page 41: Hadoop-2.6.0 Slides

Installation

● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0-src.tar.gz

● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0-src.tar.gz.mds

Page 42: Hadoop-2.6.0 Slides

Cont...

● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz.mds

Page 43: Hadoop-2.6.0 Slides

Integrity Check

● md5sum hadoop-2.6.0.tar.gz● cat hadoop-2.6.0.tar.gz.mds | grep -i md5

Page 44: Hadoop-2.6.0 Slides

Startup Code

● https://github.com/kpsubedi/BigData

Page 45: Hadoop-2.6.0 Slides

OpenSource

● Apache Hadoop http://hadoop.apache.org/

Page 46: Hadoop-2.6.0 Slides

Commercial Big Data Players

● Hortonworks http://hortonworks.com/● Cloudera http://www.cloudera.

com/content/cloudera/en/home.html● MAPR https://www.mapr.com/● Others

Page 47: Hadoop-2.6.0 Slides

Conclusion

● Thank you

Page 48: Hadoop-2.6.0 Slides

References

● Apache Hadoop http://hadoop.apache.org/

● Hortonworks http://hortonworks.com/

● Cloudera http://www.cloudera.com/content/cloudera/en/downloads.html

● MAPR https://www.mapr.com/

● The Google File System http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

Page 49: Hadoop-2.6.0 Slides

References (1)

● MapReduce: Simplified Data Processing on Large Clusters http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf