27
INTRODUCTION TO THE HADOOP ECOSYSTEM BAKING A LAYER CAKE AND BEYOND… “Qu’ils mangent de la brioche.” 1

Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

Embed Size (px)

DESCRIPTION

A high level introduction to Hadoop and its layer cake. Presented at Dubai's Big Data in Finance on Apr 2nd 2014

Citation preview

Page 1: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

INTRODUCTION TO THE HADOOP ECOSYSTEMBAKING A LAYER CAKE AND BEYOND…

“Qu’ils mangent de la brioche.”

1

Page 2: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

BEFORE WE BEGINQuestions for the

audience….How Many of You

have :Been working with Hadoop for more than

3 months?Been working with Hadoop for more than

6 months?Been working with Hadoop for more than 1 year?How many of you have heard about this thing called

‘Hadoop’ / ‘Big Data’ and thought it would be fun to check it out?

Page 3: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

About the Speaker

BSCIS - The College of Engineering, The Ohio State University

‘Big Data’ Consultant with > 25 years in IT

Working solely in the ‘Big Data’ space since 2009

Founded Chicago area Hadoop User Group (CHUG) in April 2010

1600+ Members

Over 200 different companies across all industries in the Chicagoland area.

Routinely has talked at different Conferences around the US on Hadoop.

Guest Lecture at Illinois Institute of Technology.

CoAuthored papers found on InfoQ.

MapR Admin, Cloudera Admin & Developer Certified.

3

email: MSegel (at) segel.comSkype: Michael_Segel

Page 4: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is Hadoop?

‘A Framework of software tools to allow one to take a large problem and process individual pieces in parallel. ‘

4

Page 5: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

Our Hadoop Layer Cake:

Circa 2010

Storage

Job Control

Data Access

5

Programming

Languages

Page 6: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

Data Access

Our Hadoop Layer Cake:

Circa 2013 Hadoop 2.0

Storage

Job Control

6

Resource

Control

Real Time

Messages

Confused? This is just the tip of the

iceberg.

DataFrameworks

Page 7: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

The only constant is change…

Hadoop is a disruptive technology, forcing the enterprise to rethink how it handles data.

The core Apache Framework is just the starting point.

Disruption allows new vendors to compete with established vendors.

If you can build a better mousetrap, you will attract customers.

Hadoop plays nice with others…

Page 8: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

PROPRIETARY SOFTWARE IS BAD.

“Qu’ils mangent de la brioche.”

8

‘Let them eat cake’

Myth:Reality:VENDOR LOCK IN IS BAD.

Page 9: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

HADOOP IS ONLY GOOD FOR BATCH PROCESSING

“Qu’ils mangent de la brioche.”

9

‘Let them eat cake’

Myth:Reality:HADOOP CAN ALSO BE USED FOR ‘REAL TIME’ PROBLEMS.

Page 10: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

[CENSORED]

PROJECT

DATE

CLIENT

REAL TIME HADOOPSINGLE DATA CENTER SOLUTION

Nightly Batch Jobs Create the Next Days Advertising Lists

Client Phone Connects to the web serviceWeb Service talks to Ad

Engine Phone connects to Ad Engine to get Ad

Ad Engine connects to HBase to get list of potential Ads to display, sending the correct

Ad to phone.

Page 11: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

HADOOP IS A STAND ALONE SYSTEM AND WILL REPLACE TRADITIONAL VENDOR’S PRODUCTS

“Qu’ils mangent de la brioche.”

11

‘Let them eat cake’

Myth:Reality:HADOOP IS PART OF THE ENTERPRISE . IT CAN BE STANDALONE, OR IT CAN WORK WITH EXISTING INFRASTRUCTURE.

Page 12: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

PROJECT

DATE

CLIENT

TODAY

HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….

Hadoop communicates well with the rest of the

Enterprise…

Central cluster feeds distributed web

services with local database backing…

[split in to two slides]

Page 13: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

PROJECT

DATE

CLIENT

TODAY

HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….

Hadoop communicates well with the rest of the

Enterprise…

Traditional Data Stores play nice

with Hadoop. Some seeing HDFS files

as external tables.

[split in to two slides]

Page 14: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

How Traditional Vendors view HadoopIn the beginning they saw Hadoop as a threat.

They will crush them.

If you can’t beat them, join them….

Oracle Partners with Cloudera

EMC partnered with MapR, then released its own distribution. (Green Stack)

Terradata partners with Hortonworks.

Microsoft partnered with Hortonworks.

Intel

Tried to create their own distro.

Last week, dumped their distro, made large investment in to Cloudera.

IBM … Has its own distro, yet certifies their tools to run on Cloudera

Cisco partners with MapR

Amazon (AWS) has own distro, Partners with MapR.

Page 15: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

HADOOP CLUSTERS SHOULD BE BUILT ON COMMODITY HARDWARE .

“Qu’ils mangent de la brioche.”

15

‘Let them eat cake’

Myth:Reality:YOU CAN DESIGN YOUR CLUSTER AROUND CONSTRAINTS…

Page 16: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

PROJECT

DATE

CLIENT

ALTERNATIVE CLUSTER LAYOUTSTORAGE / COMPUTE CLUSTER

A Higher Density of Disk and Compute

ClusterPremium over

Commodity Hardware

I/O LatencyCould be part of a

virtualization solution.

Page 17: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

HADOOP HADOOP IS OPEN SOURCE AND THEREFORE FREE.

“Qu’ils mangent de la brioche.”

17

‘Let them eat cake’

Myth:Reality:T.A.N.S.T.A.A.F.L ‘TANS - TAH - FELL’ (THERE AINT NO SUCH THING AS A FREE LUNCH )

Page 18: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

There aint no such thing as a free lunch…

Customers are paying for support.

Tools are primitive, requires work, no real point and click solution in place, but getting there.

Hadoop fills the gap where you want a custom solution. Merging semi-structured and structured data is going to be data dependent, requiring customization.

Beyond ETL, SQL, custom apps require developer expertise. (You must invest in skills. )

Depending on Use Case, Time to Value (TtV) will differ.

Bottom Line, there is a cost reduction over traditional solutions, but its not free.

Page 19: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

Take away…

Hadoop is a tool set that is constantly evolving.

Beware of marketing myths…

Do your own homework and talk to the vendors.

Make them earn your business.

T.A.S.T.A.A.F.L applies, you need to make an investment in terms of skills.

Hadoop isn’t a separate solution and should be part of your overall Enterprise strategy.

Hadoop isn’t a silver bullet. By itself, it doesn’t solve your business problems.

Page 20: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

YOU CAN HAVE YOUR CAKE AND EAT IT TOO!

Page 21: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

QUESTIONS?

Thank You For Your Time

Page 22: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is a layer cake?

layer cake

noun [C] US

: two or more soft cakes put on top of each other with jam, cream, icing, etc. (= a sweet mixture made from sugar) between the cakes and covering the top and sides: a term for a diagram showing how various parts of a group of components tie together in terms of a functional stack.

22

Page 23: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is Hadoop? Storage Layer

The Storage Layer is a Distributed File System that accomplishes the following:

Uniform Access from any machine in the cluster.

Fast Access (

Resiliency (Self Healing)

Redundancy (Replication)

This is known as HDFS - Hadoop File System

Page 24: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is Hadoop? Job Control Layer

The Job Control Layer is the layer that accomplishes the following:

Manages and Schedules Jobs to be run. (Default [FIFO], Capacity Scheduler,

Manages the over all job, and distributes the subprocesses across the cluster.

Manages the subprocesses being run on each node in the cluster.

This is accomplished by a Job Tracker (Cluster level) and Task Tracker (Node Level)

Page 25: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is Hadoop? Data Access Layer

The Data Access Layer is the layer that accomplishes the following:

Allows for a higher level access which can be translated to a Map/Reduce Job

Pig (Yahoo!)

Hive (Facebook)

Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)

Page 26: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

What is Hadoop? Job Flow Control Layer

The Data Access Layer is the layer that accomplishes the following:

Allows for a higher level access which can be translated to a Map/Reduce Job

Pig (Yahoo!)

Hive (Facebook)

Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)

Allows for processes to be chained together to create a work flow (Oozie)*

*No where else to put it…

Page 27: Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

List of Apache Incubator Projects associated with Hadoop:

Storm

Accumulo

Knox

Sentry

Falcon

DataFu

Drill

Tez

Twill

Phoenix

Hadoop Dev Tools

Tajo