45
www.edureka.co/big-data-and-hadoop Introduction to big data and hadoop

Introduction to Big Data & Hadoop

  • Upload
    edureka

  • View
    403

  • Download
    17

Embed Size (px)

Citation preview

Page 1: Introduction to Big Data & Hadoop

www.edureka.co/big-data-and-hadoop

Introduction to big data and hadoop

Page 2: Introduction to Big Data & Hadoop

Slide 2 www.edureka.co/big-data-and-hadoop

Page 3: Introduction to Big Data & Hadoop

Slide 3 www.edureka.co/big-data-and-hadoop

At the end of this session , you will understand the:

→ Big Data Learning Paths

→ Big Data Introduction

→ Hadoop and Its Eco-System

→ Hadoop Architecture

→ Next Step on How to Setup Hadoop

Page 4: Introduction to Big Data & Hadoop

Slide 4 www.edureka.co/big-data-and-hadoop

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduce Design Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau

Page 5: Introduction to Big Data & Hadoop

Slide 5 www.edureka.co/big-data-and-hadoop

→ Lots of Data (Terabytes or Petabytes)

→ Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

→ The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization Big Data

Page 6: Introduction to Big Data & Hadoop

Slide 6 www.edureka.co/big-data-and-hadoop

→ Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information

Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades

Page 7: Introduction to Big Data & Hadoop

Slide 7 www.edureka.co/big-data-and-hadoop

→ By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes (ZB)

→ The world’s information is doubling every two years. By 2020 the world will generate 50 times the amount of information and 75 times the number of information containers

Page 8: Introduction to Big Data & Hadoop

Slide 8 www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

VOLUME

Web logs

Images

Videos

Audios

Sensor Data

VARIETY

VELOCITY VERACITY

Min Max Mean SD

4.3 7.9 5.84 0.83

2.0 4.4 3.05 0.43

0.1 2.5 1.20 0.76

Page 9: Introduction to Big Data & Hadoop

Slide 9 www.edureka.co/big-data-and-hadoop

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Page 10: Introduction to Big Data & Hadoop

Slide 10 www.edureka.co/big-data-and-hadoop

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)

Page 11: Introduction to Big Data & Hadoop

Slide 11 www.edureka.co/big-data-and-hadoop

Ans. XML files, e-mail body → Semi-structured dataAudio, Video, Image, Files, Archived documents → Unstructured data Data from Enterprise systems (ERP, CRM etc.) → Structured data

Page 12: Introduction to Big Data & Hadoop

Slide 12 www.edureka.co/big-data-and-hadoop

More on Big Data

• http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

• http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

• http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

• http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

• http://www-01.ibm.com/software/data/bigdata/

Page 13: Introduction to Big Data & Hadoop

Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop

→ Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

→ Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

Page 14: Introduction to Big Data & Hadoop

Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop

→ Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

→ Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety

http://wiki.apache.org/hadoop/PoweredBy

Page 15: Introduction to Big Data & Hadoop

Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop

→ Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

→ Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Page 16: Introduction to Big Data & Hadoop

Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop

→ Insight into data can provide Business Advantage.

→ Some key early indicators can mean Fortunes to Business.

→ More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

Page 17: Introduction to Big Data & Hadoop

Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoop

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Inctrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Page 18: Introduction to Big Data & Hadoop

Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Page 19: Introduction to Big Data & Hadoop

Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

Page 20: Introduction to Big Data & Hadoop

Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

43 Minutes

Read 1 TB Data

Page 21: Introduction to Big Data & Hadoop

Slide 21Slide 21Slide 21 www.edureka.co/big-data-and-hadoop

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data

Page 22: Introduction to Big Data & Hadoop

Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoop

→ Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

→ It is an Open-source Data Management with scale-out storage and distributed processing.

Page 23: Introduction to Big Data & Hadoop

Slide 23 www.edureka.co/big-data-and-hadoop

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets

Page 24: Introduction to Big Data & Hadoop

Slide 24 www.edureka.co/big-data-and-hadoop

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

Page 25: Introduction to Big Data & Hadoop

Slide 25Slide 25Slide 25 www.edureka.co/big-data-and-hadoop

Pig LatinData Analysis

HiveDW System

Other YARNFrameworks (MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning

Page 26: Introduction to Big Data & Hadoop

Slide 26Slide 26Slide 26 www.edureka.co/big-data-and-hadoop

DataNode

Node Manager

DataNode DataNode DataNode

YARN

HDFSCluster

Resource Manager

NameNode

Node Manager

Node Manager

Node Manager

Page 27: Introduction to Big Data & Hadoop

Slide 27 www.edureka.co/big-data-and-hadoop

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Page 28: Introduction to Big Data & Hadoop

Slide 28 www.edureka.co/big-data-and-hadoop

Master

NameNodehttp://master:50070/

ResourceManagerhttp://master:8088

Slave01

DataNode

NodeManager

Slave02

DataNode

NodeManager

Slave03

DataNode

NodeManager

Slave04

DataNode

NodeManager

Slave05

DataNode

NodeManager

Page 29: Introduction to Big Data & Hadoop

Slide 29 www.edureka.co/big-data-and-hadoop

NodeManager

DataNode

NodeManager

HDFS YARN

NameNode

DataNode

NodeManager DataNode

ResourceManager

DataNode

NodeManager

DataNode

NodeManager

NodeManager

DataNode

NodeManager

DataNode

NodeManager

DataNode

Page 30: Introduction to Big Data & Hadoop

Slide 30 www.edureka.co/big-data-and-hadoop

Namenode NS

Storage

Nam

espa

ceBl

ock

Stor

age

Nam

espa

ce NN-1 NN-k NN-n

Common StorageBl

ock

Stor

age

Pool 1 Pool k Pool nBlock Pools

… …

Hadoop 1.0 Hadoop 2.0

DatanodeDatanode

Datanode 1 …

Datanode m …

Datanode 2 …

Block Management

Page 31: Introduction to Big Data & Hadoop

Slide 31 www.edureka.co/big-data-and-hadoop

How does HDFS Federation help HDFS Scale horizontally?a. Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace.b. Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

Page 32: Introduction to Big Data & Hadoop

Slide 32 www.edureka.co/big-data-and-hadoop

Ans. Option (a) In order to scale the name service horizontally, HDFS federation uses multiple independent NameNode. The NameNode are federated, that is, the NameNode are independent and don’t require coordination with each other.

Page 33: Introduction to Big Data & Hadoop

Slide 33 www.edureka.co/big-data-and-hadoop

You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?

Page 34: Introduction to Big Data & Hadoop

Slide 34 www.edureka.co/big-data-and-hadoop

Ans. Put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.

Page 35: Introduction to Big Data & Hadoop

Slide 35 www.edureka.co/big-data-and-hadoop

Node Manager

ContainerApp

Master

Node Manager

ContainerApp

Master

HDFS YARN

Resource Manager

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

DataNode

Standby NameNode

Active NameNode

DataNode Data Node

DataNode

DataNode

NameNode High Availability

Next Generation MapReduce

*Not necessary to configure Secondary NameNode

Client

Shared Edit Logs

HDFS HIGH AVAILABILITY

Node Manager

ContainerApp

Master

Node Manager

ContainerApp

Master

Page 36: Introduction to Big Data & Hadoop

Slide 36 www.edureka.co/big-data-and-hadoop

Node Manager

ContainerApp

Master

Node Manager

ContainerApp

Master

HDFS YARN

Resource Manager

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

DataNode

Standby NameNode

Active NameNode

DataNode Data Node

DataNode

DataNode

NameNode High Availability

Next Generation MapReduce

*Not necessary to configure Secondary NameNode

Client

Shared Edit Logs

HDFS HIGH AVAILABILITY

Node Manager

ContainerApp

Master

Node Manager

ContainerApp

Master

Page 37: Introduction to Big Data & Hadoop

Slide 37 www.edureka.co/big-data-and-hadoop

HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0?a. Single Point of Failure of Name-Nodeb. Only one version can be run in classic Map-Reducec. Too much burden on Job Tracker

Page 38: Introduction to Big Data & Hadoop

Slide 38 www.edureka.co/big-data-and-hadoop

Ans. Single Point of Failure of NameNode

Page 40: Introduction to Big Data & Hadoop

Slide 40 www.edureka.co/big-data-and-hadoop

Facebook

→ We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.

→ Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.

» Each (commodity) node has 8 cores and 12 TB of storage.

» We are heavy users of both streaming as well as the Java APIs. We have built

a higher level data warehousing framework using these features called Hive

(see the http://Hadoop.apache.org/hive/). We have also developed a FUSE

implementation over HDFS.

Page 41: Introduction to Big Data & Hadoop

Slide 41 www.edureka.co/big-data-and-hadoop

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

→ No daemons, everything runs in a single JVM.→ Suitable for running MapReduce programs during development.→ Has no DFS.

→ Hadoop daemons run on the local machine.

→ Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Page 42: Introduction to Big Data & Hadoop

Slide 42 www.edureka.co/big-data-and-hadoop

→ Apache Hadoop and HDFS

→ http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

→ Apache Hadoop HDFS Architecture

→ http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Page 43: Introduction to Big Data & Hadoop

Slide 43 www.edureka.co/big-data-and-hadoop

• Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Page 44: Introduction to Big Data & Hadoop

Slide 44

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/big-data-and-hadoop

Page 45: Introduction to Big Data & Hadoop