Introduction to Big data & Hadoop -I

www.edureka.co/big-data-and-hadoop

Introduction to big data and hadoop

Slide 2 www.edureka.co/big-data-and-hadoop

Objectives

At the end of this session , you will understand the:

Big Data Learning Paths

Big Data Introduction

Hadoop and Its Eco-System

Hadoop Architecture

Next Step on How to Setup Hadoop


Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau


Un-structured Data is Exploding

Source: Twitter


IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

http://www-01.ibm.com/software/data/bigdata/


Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.


Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)


Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data


Further Reading

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics


http://www.edureka.in/blog/the-hype-behind-big-data/

http://www.edureka.in/blog/why-hadoop/

http://www.edureka.in/blog/jobs-in-hadoop/

http://en.wikipedia.org/wiki/Big_Data


Slide 10Slide 10Slide 10 www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy



Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety


Common Big Data Customer Scenarios (Contd.)



Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis




Why DFS?

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine


Why DFS? (Contd.)



10 Machine

43 Minutes

Read 1 TB Data


Why DFS? (Contd.)



10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data



RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply


Active NameNodeSecondary NameNode

DataNode DataNode


StandBy NameNode


Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?


Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Inctrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture


Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Solution: A Combined Storage Computer Layer


Annie’s Question

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets


Annie’s Answer

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.


Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning


BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

Hadoop daemons run on the local machine.

Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Hadoop Cluster Modes


Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Project Work

Verifiable Certificate

1. Assistance from Peers and Support team

2. Review for Certification


CA - Single site cluster, therefore all nodes are alwaysin contact. When a partition occurs, the system blocks.

CP - Some data may not be accessible, but the rest isstill consistent/accurate.

AP - System is still available under partitioning, butsome of the data returned may be inaccurate.

Here is the brief description of three combinations CA, CP, AP :

Cap Theorem


Further Reading

Apache Hadoop and HDFS

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/


Assignment

Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Slide 30

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/big-data-and-hadoop

Survey

Technology

Introduction to Big data & Hadoop -I