Upload
edureka
View
561
Download
2
Tags:
Embed Size (px)
Citation preview
www.edureka.co/big-data-and-hadoop
Introduction to big data and hadoop
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this session , you will understand the:
Big Data Learning Paths
Big Data Introduction
Hadoop and Its Eco-System
Hadoop Architecture
Next Step on How to Setup Hadoop
Slide 3 www.edureka.co/big-data-and-hadoop
Big Data Learning Path
• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark
• Linux Administration• Cluster Management• Cluster Performance• Virtualization
• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R
Developer/Testing
Administration
Data Analyst
Big Data and Hadoop
MapReduceDesign Patterns
Apache Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics Using R
Advance Predictive Modelling in R
Talend for Big Data
Data Visualization Using Tableau
Slide 4 www.edureka.co/big-data-and-hadoop
Un-structured Data is Exploding
Source: Twitter
Slide 5 www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
IBM’s Definition of Big Data
Slide 6 www.edureka.co/big-data-and-hadoop
Annie’s Introduction
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Slide 7 www.edureka.co/big-data-and-hadoop
Annie’s Question
Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)
Slide 8 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data
Slide 9 www.edureka.co/big-data-and-hadoop
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 10Slide 10Slide 10 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios
Web and e-tailing
» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection
Telecommunications
» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 11Slide 11Slide 11 www.edureka.co/big-data-and-hadoop
Government
» Fraud Detection and Cyber Security» Welfare Schemes » Justice
Healthcare and Life Sciences
» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
Slide 12Slide 12Slide 12 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis
Retail
» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop
Why DFS?
Read 1 TB Data
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
43 Minutes
Read 1 TB Data
Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
4.3 Minutes43 Minutes
Read 1 TB Data
Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop
Slide 17 www.edureka.co/big-data-and-hadoop
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
StandBy NameNode
Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop
Hidden Treasure
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Case Study: Sears Holding Corporation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Inctrumentation
A meagre 10% of the
~2PB data is available for
BI
Storage
2. Moving data to compute doesn’t scale
90% of the ~2PB archived
Processing
3. Premature data death
1. Can’t explore original high fidelity raw data
Limitations of Existing Data Analytics Architecture
Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both Storage
And Processing
Entire ~2PB Data is
available for processing
No Data Archiving
1. Data Exploration & Advanced analytics
2. Scalable throughput for ETL & aggregation
3. Keep data alive forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
Solution: A Combined Storage Computer Layer
Slide 21 www.edureka.co/big-data-and-hadoop
Annie’s Question
Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets
Slide 22 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
Slide 23Slide 23Slide 23 www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
Pig LatinData Analysis
HiveDW System
OtherYARN
Frameworks(MPI, GRAPH)
HBaseMapReduce Framework
YARNCluster Resource Management
Apache Oozie(Workflow)
HDFS(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or Semi-structured Data Structured Data
Flume
MahoutMachine Learning
Slide 24 www.edureka.co/big-data-and-hadoop
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4, …)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
OTHER(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
Slide 25 www.edureka.co/big-data-and-hadoop
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.
Hadoop daemons run on the local machine.
Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode
Hadoop Cluster Modes
Slide 26 www.edureka.co/big-data-and-hadoop
Learning Path to Certification
CourseLIVE Online Class Class Recording in LMS
24/7 Post Class Support Module Wise Quiz and Assignment
Project Work
Verifiable Certificate
1. Assistance from Peers and Support team
2. Review for Certification
Slide 27Slide 27Slide 27 www.edureka.co/big-data-and-hadoop
CA - Single site cluster, therefore all nodes are alwaysin contact. When a partition occurs, the system blocks.
CP - Some data may not be accessible, but the rest isstill consistent/accurate.
AP - System is still available under partitioning, butsome of the data returned may be inaccurate.
Here is the brief description of three combinations CA, CP, AP :
Cap Theorem
Slide 28 www.edureka.co/big-data-and-hadoop
Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Slide 29 www.edureka.co/big-data-and-hadoop
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 30
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoop
Survey