61

Hadoop hbase mapreduce

Embed Size (px)

DESCRIPTION

A brief description about Hadoop, HDFS, MapReduce , Hive and Pig

Citation preview

Page 1: Hadoop hbase mapreduce
Page 2: Hadoop hbase mapreduce

What is Big Data ?

● How is big “Big Data” ?● Is 30 40 Terabyte big data ?● ….

● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools

● Today Terabyte, Petabyte, Exabyte● Tomorrow ?

Page 3: Hadoop hbase mapreduce

Enterprises & Big Data

● Most companies are currently using traditional tools to store data

● Big data: The next frontier for innovation, competition, and productivity

● The use of big data will become a key basis of competition

● Organisations across the globe need to take the rising importance of big data more seriously

Page 4: Hadoop hbase mapreduce

Hadoop is an ecosystem, not a single product.

When you deal with BigData, the data center is your computer.

Page 5: Hadoop hbase mapreduce

• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem

Page 6: Hadoop hbase mapreduce

• Hadoop has its origins in Apache Nutch

• Nutch was started in 2002

• Challenge : The billions of pages on the Web ?

• 2003 GFS (Google File System)

• 2004 NDFS (Nutch File System)

• 2004 Google published the paper of MapReduce

• 2005 Nutch Developers getting started with development of MapReduce

A Brief History of Hadoop

Page 7: Hadoop hbase mapreduce

• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem

Page 8: Hadoop hbase mapreduce

Contributers and Development

Lifetime patches contributed for all Hadoop-related projects: community members by current employer* source : JIRA tickets

Page 9: Hadoop hbase mapreduce

Contributers and Development

Page 10: Hadoop hbase mapreduce

Contributers and Development

* Resource: Kerberos Konference (Yahoo) – 2010

Page 11: Hadoop hbase mapreduce

Development in ASF/Hadoop

● Resources● Mailing List● Wiki Pages , blogs● Issue Tracking – JIRA● Version Control SVN – Git

Page 12: Hadoop hbase mapreduce

• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem

Page 13: Hadoop hbase mapreduce

What is Hadoop

• Open-source project administered by the ASF

• Data Intensive Storage

• and Massivly Paralel Processing(MPP)

• Enables applications to work with thousands of nodes and petabytes of data

• Suitable for application with large data sets

Page 14: Hadoop hbase mapreduce

What is Hadoop ?

• Scalable

• Fault Tolerance

• Reliable data storage using the Hadoop Distributed File System (HDFS)

• High-performance parallel data processing using a technique called MapReduce

Page 15: Hadoop hbase mapreduce

What is Hadoop ?

• Hadoop Becoming defacto standard for large scale dataprocessing

• Becoming more than just MapReduce

• Ecosystem growing rapidly lot’s of great tools around it

Page 16: Hadoop hbase mapreduce

What is Hadoop ?

Yahoo Hadoop Cluster

SGI Hadoop Cluster

38,000 machines distributed across 20 different clusters. Recource : Yahoo 2010

50,000 m : January 2012Resource http://www.computerworlduk.com/in-depth/applications/3329092/hadoop-could-save-you-money-over-a-traditional-rdbms/

Page 17: Hadoop hbase mapreduce

• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem

Page 18: Hadoop hbase mapreduce

Why Hadoop?

Page 19: Hadoop hbase mapreduce

Why Hadoop?

Page 20: Hadoop hbase mapreduce

Why Hadoop?

Page 21: Hadoop hbase mapreduce

Why Hadoop? • Hadoop has its origins in Apache Nutch

• Can Process Big Data (Petabytes and more..)

• Unlimited Data Storage & Analyse

• No licence cost - Apache License 2.0

• Can be build out of the commodity hardware

• IT Cost Reduction

• Results

• Be One Step Ahead of Competition

• Stay there

Page 22: Hadoop hbase mapreduce

Is hadoop alternative for RDBMs ?• At the moment Apache Hadoop is not a substitute for a database

• No Relation

• Key Value pairs

• Big Data

• unstructured (Text)

• semi structured (Seq / Binary Files)

• Structured (Hbase=Google BigTable)

• Works fine together with RDBMs

Page 23: Hadoop hbase mapreduce

• A Brief History of Hadoop• Contributers and Development• What is Hadoop • Wyh Hadoop • Hadoop Ecosystem

Page 24: Hadoop hbase mapreduce

Hadoop Ecosystem

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI ReportingETL Tools

Sqoop

RDBMS

Page 25: Hadoop hbase mapreduce

Hadoop Ecosystem

Important components of Hadoop

• HDFS: A distributed, fault tolerance file system

• MapReduce: A paralel data processing framework

• Hive : A query framework (like SQL)

• PIG : A query scripting tool

• HBase : realtime read/write access to your Big Data

Page 26: Hadoop hbase mapreduce

Hadoop EcosystemHadoop is a Distributed Data Computing Platform

Page 27: Hadoop hbase mapreduce

HDFS

Page 28: Hadoop hbase mapreduce

HDFS

NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.»

Page 29: Hadoop hbase mapreduce

Hadoop Cluster

Page 30: Hadoop hbase mapreduce

Writing Files To HDFS

• Client consults NameNode

• Client writes block directly to

one DataNode

• DataNote replicates block

• Cycle repeats for next block

Page 31: Hadoop hbase mapreduce

Reading Files From HDFS

• Client consults NameNode

• Client receives Data Node list for each block

• Client picks first Data Node for each block

• Client reads blocks sequentially

Page 32: Hadoop hbase mapreduce

Rackawareness & Fault Tolerance

NameNode

Rack Aware Metadata

Rack 1:DN1DN2DN3DN5

Rack 5:DN5DN6DN7DN8

Rack N

File.txtBlk A:DN1,DN5,DN6

Blk B:DN1,DN2,DN9

BLKC:DN5,DN9,DN10

• Never loose all data if entire rack fails

• In Rack is higher bandwidth , lower latency

Page 33: Hadoop hbase mapreduce

Cluster Healt

Page 34: Hadoop hbase mapreduce

Hadoop Ecosystem

Important components of Hadoop

• HDFS: A distributed, fault tolerance file system

• MapReduce: A paralel data processing framework

• Hive : A query framework (like SQL)

• PIG : A query scripting tool

• HBase : A Column oriented Database for OLTP

Page 35: Hadoop hbase mapreduce

MapReduce-Paradigm

• Simplified Data Processing on Large Clusters

• Splitting a Big Problem/Data into Little PiecesHive

• Key-Value

Page 36: Hadoop hbase mapreduce

MapReduce-Batch Processing

• Phases

• Map

• Sort/Shuffle

• Reduce (Aggregation)

• Coordination

• Job Tracker

• Task Tracker

Page 37: Hadoop hbase mapreduce

MapReduce-Map

MAP

MAP

MAP

K V1111

1111

1111

Datanode 1

Datanode 2

Datanode 3

Page 38: Hadoop hbase mapreduce

MapReduce-Sort/Shuffle

1

1

1

1

1

1

1

1

1

1

1

1

SORT

SORT

SORT

Datanode 1

Datanode 2

Datanode 3

Page 39: Hadoop hbase mapreduce

MapReduce-Reduce

1

1

1

1

1

1

1

1

1

1

1

1

4

3

3

2

REDUCE

REDUCE

REDUCE

SORT

SORT

SORT

K V

K V

K V

Datanode 1

Datanode 2

Datanode 3

Page 40: Hadoop hbase mapreduce

MapReduce-All Phases

1

1

1

1

1

1

1

11

1

11

4

3

3

2

REDUCE

REDUCE

REDUCE

SORT

SORT

SORT

MAP

MAP

MAP

1111

1111

1111

Page 41: Hadoop hbase mapreduce

MapReduce-Job & Task Tracker

JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster

Namenode

Datanodes

Page 42: Hadoop hbase mapreduce

Summary of HDFS and MR

Page 43: Hadoop hbase mapreduce

Hadoop Ecosystem

Important components of Hadoop

• HDFS: A distributed, fault tolerance file system

• MapReduce: A paralel data processing framework

• Hive : A query framework (like SQL)

• PIG : A query scripting tool

• HBase : A Column oriented Database for OLTP

Page 44: Hadoop hbase mapreduce

Hive

Page 45: Hadoop hbase mapreduce

Hive

• Data warehousing package built on top of Hadoop

• It began its life at Facebook processing large amount of user

and log data

• Hadoop subproject with many contributors

• Ad hoc queries , summarization , and data analysis on Hadoop-

scale data

• Directly query data from different formats (text/binary) and file

formats (Flat/Sequence)

• HiveQL - like SQL

Page 46: Hadoop hbase mapreduce

Hive ComponentsHDFS

Hive CLI

DDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

Execution

Hive QL

Parser

Planner

Mgm

t. W

eb

UI

*Thrift : Interface Definition Lang.

Page 47: Hadoop hbase mapreduce

Hadoop Ecosystem

Important components of Hadoop

• HDFS: A distributed, fault tolerance file system

• MapReduce: A paralel data processing framework

• Hive : A query framework (like SQL)

• PIG : A query scripting tool

• HBase : A Column oriented Database for OLTP

Page 48: Hadoop hbase mapreduce

Pig

• The language used to express data flows, called Pig Latin

• Pig Latin can be extended using UDF (User Defined Functions)

• was originally developed at Yahoo Research

• PigPen is an Eclipse plug-in that provides an environment for

developing Pig programs

• Running Pig Programs

• Script ; script file that contains Pig commands

• Grunt ; interactive shell

• Embedded ; java

Page 49: Hadoop hbase mapreduce

Piggrunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)

grunt> DESCRIBE records;records: {year: chararray,temperature: int,quality: int}

grunt> filtered_records = FILTER records BY temperature != 22 );grunt> DUMP filtered_records;

grunt> grouped_records = GROUP records BY year;grunt> DUMP grouped_records;(1949,{(1949,111,1),(1949,78,1)})(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

Page 50: Hadoop hbase mapreduce

Hadoop Ecosystem

Important components of Hadoop

• HDFS: A distributed, fault tolerance file system

• MapReduce: A paralel data processing framework

• Hive : A query framework (like SQL)

• PIG : A query scripting tool

• HBase : A Column oriented Database for OLTP

Page 51: Hadoop hbase mapreduce

HBase

• Random, realtime read/write access to your Big Data

• Billions of rows X millions of columns

• Column-oriented store modeled after Google's BigTable

• provides Bigtable-like capabilities on top of Hadoop and HDFS

• HBase is not a column-oriented database in the typical RDBMS

sense, but utilizes an on-disk column storage format

Page 52: Hadoop hbase mapreduce

HBase-Datamodel

• Think of tags. Values any length, no predefined names or widths

• Column names carry info (just like tags)

• (Table, RowKey, Family,Column, Timestamp) → Value

Page 53: Hadoop hbase mapreduce

HBase-Datamodel

• (Table, RowKey, Family,Column, Timestamp) → Value

Page 54: Hadoop hbase mapreduce

HBase-Datamodel

• (Table, RowKey, Family,Column, Timestamp) → Value

Page 55: Hadoop hbase mapreduce

Create Sample Table  hbase(main):003:0> create 'test', 'cf'

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11'

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12'

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'

hbase(main):007:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1288380727188, value=value12

row2 column=cf:b, timestamp=1288380738440, value=value2

row3 column=cf:c, timestamp=1288380747365, value=value3

hbase(main):007:0> scan 'test', { VERSIONS => 3 }

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1288380727188, value=value12

row1 column=cf:a, timestamp=1288380727188, value=value11

row2 column=cf:b, timestamp=1288380738440, value=value2

row3 column=cf:c, timestamp=1288380747365, value=value3

Page 56: Hadoop hbase mapreduce

Hbase-Architecture

• Splits

• Auto-Sharding

• Master

• Region Servers

• HFile

Page 57: Hadoop hbase mapreduce

Splits & RegionServers

• Rows grouped in regions and served by different servers• Table dynamically split into “regions” • Each region contains values [startKey, endKey) • Regions hosted on a regionserver 

Page 58: Hadoop hbase mapreduce

Hbase-Architecture

Page 59: Hadoop hbase mapreduce

Other Components

• Flume

• Sqoop

Page 60: Hadoop hbase mapreduce

Commertial Products

• Oracle Big Data Appliance

• Microsoft Azure + Excel + MapReduce

• Cloud Computing , Amazon elastic computing

• IBM Hadoop-based InfoSphere BigInsights

• VMWare Spring for Apache Hadoop

• Toad for Cloud Database

• Mapr , Cloudera , HortonWorks, Datameer

Page 61: Hadoop hbase mapreduce

Thank You

Faruk Berksö[email protected]