Upload
michael-segel
View
288
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A high level introduction to Hadoop and its layer cake. Presented at Dubai's Big Data in Finance on Apr 2nd 2014
Citation preview
INTRODUCTION TO THE HADOOP ECOSYSTEMBAKING A LAYER CAKE AND BEYOND…
“Qu’ils mangent de la brioche.”
1
BEFORE WE BEGINQuestions for the
audience….How Many of You
have :Been working with Hadoop for more than
3 months?Been working with Hadoop for more than
6 months?Been working with Hadoop for more than 1 year?How many of you have heard about this thing called
‘Hadoop’ / ‘Big Data’ and thought it would be fun to check it out?
About the Speaker
BSCIS - The College of Engineering, The Ohio State University
‘Big Data’ Consultant with > 25 years in IT
Working solely in the ‘Big Data’ space since 2009
Founded Chicago area Hadoop User Group (CHUG) in April 2010
1600+ Members
Over 200 different companies across all industries in the Chicagoland area.
Routinely has talked at different Conferences around the US on Hadoop.
Guest Lecture at Illinois Institute of Technology.
CoAuthored papers found on InfoQ.
MapR Admin, Cloudera Admin & Developer Certified.
3
email: MSegel (at) segel.comSkype: Michael_Segel
What is Hadoop?
‘A Framework of software tools to allow one to take a large problem and process individual pieces in parallel. ‘
4
Our Hadoop Layer Cake:
Circa 2010
Storage
Job Control
Data Access
5
Programming
Languages
Data Access
Our Hadoop Layer Cake:
Circa 2013 Hadoop 2.0
Storage
Job Control
6
Resource
Control
Real Time
Messages
Confused? This is just the tip of the
iceberg.
DataFrameworks
The only constant is change…
Hadoop is a disruptive technology, forcing the enterprise to rethink how it handles data.
The core Apache Framework is just the starting point.
Disruption allows new vendors to compete with established vendors.
If you can build a better mousetrap, you will attract customers.
Hadoop plays nice with others…
PROPRIETARY SOFTWARE IS BAD.
“Qu’ils mangent de la brioche.”
8
‘Let them eat cake’
Myth:Reality:VENDOR LOCK IN IS BAD.
HADOOP IS ONLY GOOD FOR BATCH PROCESSING
“Qu’ils mangent de la brioche.”
9
‘Let them eat cake’
Myth:Reality:HADOOP CAN ALSO BE USED FOR ‘REAL TIME’ PROBLEMS.
[CENSORED]
PROJECT
DATE
CLIENT
REAL TIME HADOOPSINGLE DATA CENTER SOLUTION
Nightly Batch Jobs Create the Next Days Advertising Lists
Client Phone Connects to the web serviceWeb Service talks to Ad
Engine Phone connects to Ad Engine to get Ad
Ad Engine connects to HBase to get list of potential Ads to display, sending the correct
Ad to phone.
HADOOP IS A STAND ALONE SYSTEM AND WILL REPLACE TRADITIONAL VENDOR’S PRODUCTS
“Qu’ils mangent de la brioche.”
11
‘Let them eat cake’
Myth:Reality:HADOOP IS PART OF THE ENTERPRISE . IT CAN BE STANDALONE, OR IT CAN WORK WITH EXISTING INFRASTRUCTURE.
PROJECT
DATE
CLIENT
TODAY
HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….
Hadoop communicates well with the rest of the
Enterprise…
Central cluster feeds distributed web
services with local database backing…
[split in to two slides]
PROJECT
DATE
CLIENT
TODAY
HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….
Hadoop communicates well with the rest of the
Enterprise…
Traditional Data Stores play nice
with Hadoop. Some seeing HDFS files
as external tables.
[split in to two slides]
How Traditional Vendors view HadoopIn the beginning they saw Hadoop as a threat.
They will crush them.
If you can’t beat them, join them….
Oracle Partners with Cloudera
EMC partnered with MapR, then released its own distribution. (Green Stack)
Terradata partners with Hortonworks.
Microsoft partnered with Hortonworks.
Intel
Tried to create their own distro.
Last week, dumped their distro, made large investment in to Cloudera.
IBM … Has its own distro, yet certifies their tools to run on Cloudera
Cisco partners with MapR
Amazon (AWS) has own distro, Partners with MapR.
HADOOP CLUSTERS SHOULD BE BUILT ON COMMODITY HARDWARE .
“Qu’ils mangent de la brioche.”
15
‘Let them eat cake’
Myth:Reality:YOU CAN DESIGN YOUR CLUSTER AROUND CONSTRAINTS…
PROJECT
DATE
CLIENT
ALTERNATIVE CLUSTER LAYOUTSTORAGE / COMPUTE CLUSTER
A Higher Density of Disk and Compute
ClusterPremium over
Commodity Hardware
I/O LatencyCould be part of a
virtualization solution.
HADOOP HADOOP IS OPEN SOURCE AND THEREFORE FREE.
“Qu’ils mangent de la brioche.”
17
‘Let them eat cake’
Myth:Reality:T.A.N.S.T.A.A.F.L ‘TANS - TAH - FELL’ (THERE AINT NO SUCH THING AS A FREE LUNCH )
There aint no such thing as a free lunch…
Customers are paying for support.
Tools are primitive, requires work, no real point and click solution in place, but getting there.
Hadoop fills the gap where you want a custom solution. Merging semi-structured and structured data is going to be data dependent, requiring customization.
Beyond ETL, SQL, custom apps require developer expertise. (You must invest in skills. )
Depending on Use Case, Time to Value (TtV) will differ.
Bottom Line, there is a cost reduction over traditional solutions, but its not free.
Take away…
Hadoop is a tool set that is constantly evolving.
Beware of marketing myths…
Do your own homework and talk to the vendors.
Make them earn your business.
T.A.S.T.A.A.F.L applies, you need to make an investment in terms of skills.
Hadoop isn’t a separate solution and should be part of your overall Enterprise strategy.
Hadoop isn’t a silver bullet. By itself, it doesn’t solve your business problems.
YOU CAN HAVE YOUR CAKE AND EAT IT TOO!
QUESTIONS?
Thank You For Your Time
What is a layer cake?
layer cake
noun [C] US
: two or more soft cakes put on top of each other with jam, cream, icing, etc. (= a sweet mixture made from sugar) between the cakes and covering the top and sides: a term for a diagram showing how various parts of a group of components tie together in terms of a functional stack.
22
What is Hadoop? Storage Layer
The Storage Layer is a Distributed File System that accomplishes the following:
Uniform Access from any machine in the cluster.
Fast Access (
Resiliency (Self Healing)
Redundancy (Replication)
This is known as HDFS - Hadoop File System
What is Hadoop? Job Control Layer
The Job Control Layer is the layer that accomplishes the following:
Manages and Schedules Jobs to be run. (Default [FIFO], Capacity Scheduler,
Manages the over all job, and distributes the subprocesses across the cluster.
Manages the subprocesses being run on each node in the cluster.
This is accomplished by a Job Tracker (Cluster level) and Task Tracker (Node Level)
What is Hadoop? Data Access Layer
The Data Access Layer is the layer that accomplishes the following:
Allows for a higher level access which can be translated to a Map/Reduce Job
Pig (Yahoo!)
Hive (Facebook)
Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)
What is Hadoop? Job Flow Control Layer
The Data Access Layer is the layer that accomplishes the following:
Allows for a higher level access which can be translated to a Map/Reduce Job
Pig (Yahoo!)
Hive (Facebook)
Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)
Allows for processes to be chained together to create a work flow (Oozie)*
*No where else to put it…
List of Apache Incubator Projects associated with Hadoop:
Storm
Accumulo
Knox
Sentry
Falcon
DataFu
Drill
Tez
Twill
Phoenix
Hadoop Dev Tools
Tajo