Upload
mindsmapped-consulting
View
47.719
Download
1
Embed Size (px)
Citation preview
Introduction to Big Data & Hadoop
Big Data Hadoop Training
Introduction to Big Data
Page 3Classification: Restricted
Importance Of Data
“Data is the new oil,” said Andreas Weigend, social data guru and former chief scientist at Amazon.com. “Oil needs to be refined before it can be useful.”
Page 4Classification: Restricted
ESG Report on Analytics:
• Majority of organizations view data analytics as a top 5 business and IT priority.• Reduced costs and process improvement are top data analytics platform benefits.• No leading data analytics platform has emerged yet. Nearly one-third of the
organizations surveyed are using a custom-developed solution.• Big data is driving changes in analytics tools, infrastructure, and processes.
Page 5Classification: Restricted
Meaning of the term BigData
Page 6Classification: Restricted
Size of the largest dataset for processing
Page 7Classification: Restricted
Number of Data Sources to integrate
Page 8Classification: Restricted
Update frequency of the largest data set
Page 9Classification: Restricted
Challenges while processing data
Page 10Classification: Restricted
Key benefits from processing data
Page 11Classification: Restricted
Big Data & its hype..
• Gartner: Hadoop will be in two-thirds of advanced analytics products by 2015• Livemint.com: SMAC is the new flavour of IT companies• SMAC will allow the IT industry to offer more value to the clients• Offshore Insights: Growth of IT companies will be dictated by cloud, mobile, analytics,
big data and social media services, according to a survey of 410 global IT decision-makers by research firm Offshore Insights, released in February
Page 12Classification: Restricted
What is Big Data ?
• Lots of Data (in terms of Terabytes or Petabytes)• It is a term applied to data-sets whose size is beyond the ability of commonly used
software tools to capture, manage & process within a tolerable elapsed time.• Systems/Enterprises generate huge amount of data from Terabytes to even
Petabytes.
Page 13Classification: Restricted
Structured Vs Unstructured
Page 14Classification: Restricted
Big Data Characteristics
• Big Data is characterized by 3 Vs
Page 15Classification: Restricted
Time for Quiz
• For the given file formats, identify which category of data that it belongs to:• Word Docs, PDFs, Tetxt files• eMail body• XML files• Data generated by ERPs, CRMs etc
Page 16Classification: Restricted
Big Data Users & Scenarios
Page 17Classification: Restricted
Challenges Of Big Data
• Problem #1 : Slow Disk Reads/Writes
• Problem #2 : Hardware Failures
• Problem #3 : Data integration & Transfer
Page 18Classification: Restricted
Why Distributed Processing?To Read 1 TB of data:
Disk seek-time: 100 Mb/sec Disk seek-time:
100 Mb/sec
Page 19Classification: Restricted
Why Distributed Processing?To Read 1 TB of data:
Time to Process: (1TB/100MB) = 10485 sec or 175min.
Time to Process: (1TB/5*100MB) = 2097 sec or 35 min.
Introduction to Hadoop
Page 21Classification: Restricted
Course Contents:
History of hadoopHadoop EcosystemHadoop Animal PlanetWhat is Hadoop?Distinctions of hadoopHadoop ComponentsAnatomy of a File WriteAnatomy of a File ReadReplication & Rack awareness
Page 22Classification: Restricted
History of Hadoop
Page 23Classification: Restricted
Hadoop Ecosystem
Page 24Classification: Restricted
Hadoop Animal Planet
Page 25Classification: Restricted
• The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
What is Hadoop?
Page 26Classification: Restricted
Key Distinctions of Hadoop
HADOOPScalable Robust
Accessible
Simple
Page 27Classification: Restricted
Hadoop Components
Page 28Classification: Restricted
• HDFS – Hadoop Distributed File System(storage):• Data is split and distributed across nodes• Each split is replicated• Namenode is the master & Datanodes are the slaves
• Mapreduce(processing):• Splits a task across processors• Execution is Near the data & the results are merged• Self-healing• Jobtracker is the master & Task trackers are slaves
Hadoop Components
Page 29Classification: Restricted
Hadoop Components
MapReduce
HDFS Cluster
Job Tracker
Namenode
Task Tracker
Task Tracker
Task Tracker
Data Node Data Node Data Node
Page 30Classification: Restricted
• NameNode• It is the master node & responsible for the entire cluster• Manages the filesystem namespace• Enterprise level software is used
• DataNode• Slaves which run on commodity/cheap hardware• Store and retrieve data when they are told to (by client or Namenode)• Sends heart-beat signals to NN with the blocks that they store
• Secondary Node• It is a backup for the Namenode
Storage Components
Page 31Classification: Restricted
• Job Tracker:• Coordinates all the jobs run on the system by scheduling tasks • Keeps a record of overall progress of each job• If a job fails, reschedules the job on a different tasktracker
• Task Tracker:• Slave daemon which accepts tasks to be run a block of data• Sends progress reports as heart beat signals to the Job tracker at
regular intervals
Processing components
Page 32Classification: Restricted
HDFS
Page 33Classification: Restricted
Mapreduce Job
Page 34Classification: Restricted
Anatomy of a File Read
Page 35Classification: Restricted
Anatomy of a File Write
Page 36Classification: Restricted
Replication & Rack awareness
Block A: Block B: Block C:
Rack 1
1
2
3
4
Rack 2
5
6
7
8
Rack 3
9
10
11
12