Dubai Big Data in Finance, Intro to Hadoop 2-Apr-14 - Michael Segel

Preview:

DESCRIPTION

A high level introduction to Hadoop and its layer cake. Presented at Dubai's Big Data in Finance on Apr 2nd 2014

Citation preview

INTRODUCTION TO THE HADOOP ECOSYSTEMBAKING A LAYER CAKE AND BEYOND…

“Qu’ils mangent de la brioche.”

1

BEFORE WE BEGINQuestions for the

audience….How Many of You

have :Been working with Hadoop for more than

3 months?Been working with Hadoop for more than

6 months?Been working with Hadoop for more than 1 year?How many of you have heard about this thing called

‘Hadoop’ / ‘Big Data’ and thought it would be fun to check it out?

About the Speaker

BSCIS - The College of Engineering, The Ohio State University

‘Big Data’ Consultant with > 25 years in IT

Working solely in the ‘Big Data’ space since 2009

Founded Chicago area Hadoop User Group (CHUG) in April 2010

1600+ Members

Over 200 different companies across all industries in the Chicagoland area.

Routinely has talked at different Conferences around the US on Hadoop.

Guest Lecture at Illinois Institute of Technology.

CoAuthored papers found on InfoQ.

MapR Admin, Cloudera Admin & Developer Certified.

3

email: MSegel (at) segel.comSkype: Michael_Segel

What is Hadoop?

‘A Framework of software tools to allow one to take a large problem and process individual pieces in parallel. ‘

4

Our Hadoop Layer Cake:

Circa 2010

Storage

Job Control

Data Access

5

Programming

Languages

Data Access

Our Hadoop Layer Cake:

Circa 2013 Hadoop 2.0

Storage

Job Control

6

Resource

Control

Real Time

Messages

Confused? This is just the tip of the

iceberg.

DataFrameworks

The only constant is change…

Hadoop is a disruptive technology, forcing the enterprise to rethink how it handles data.

The core Apache Framework is just the starting point.

Disruption allows new vendors to compete with established vendors.

If you can build a better mousetrap, you will attract customers.

Hadoop plays nice with others…

PROPRIETARY SOFTWARE IS BAD.

“Qu’ils mangent de la brioche.”

8

‘Let them eat cake’

Myth:Reality:VENDOR LOCK IN IS BAD.

HADOOP IS ONLY GOOD FOR BATCH PROCESSING

“Qu’ils mangent de la brioche.”

9

‘Let them eat cake’

Myth:Reality:HADOOP CAN ALSO BE USED FOR ‘REAL TIME’ PROBLEMS.

[CENSORED]

PROJECT

DATE

CLIENT

REAL TIME HADOOPSINGLE DATA CENTER SOLUTION

Nightly Batch Jobs Create the Next Days Advertising Lists

Client Phone Connects to the web serviceWeb Service talks to Ad

Engine Phone connects to Ad Engine to get Ad

Ad Engine connects to HBase to get list of potential Ads to display, sending the correct

Ad to phone.

HADOOP IS A STAND ALONE SYSTEM AND WILL REPLACE TRADITIONAL VENDOR’S PRODUCTS

“Qu’ils mangent de la brioche.”

11

‘Let them eat cake’

Myth:Reality:HADOOP IS PART OF THE ENTERPRISE . IT CAN BE STANDALONE, OR IT CAN WORK WITH EXISTING INFRASTRUCTURE.

PROJECT

DATE

CLIENT

TODAY

HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….

Hadoop communicates well with the rest of the

Enterprise…

Central cluster feeds distributed web

services with local database backing…

[split in to two slides]

PROJECT

DATE

CLIENT

TODAY

HADOOP AND THE ENTERPRISEWE CAN ALL GET ALONG….

Hadoop communicates well with the rest of the

Enterprise…

Traditional Data Stores play nice

with Hadoop. Some seeing HDFS files

as external tables.

[split in to two slides]

How Traditional Vendors view HadoopIn the beginning they saw Hadoop as a threat.

They will crush them.

If you can’t beat them, join them….

Oracle Partners with Cloudera

EMC partnered with MapR, then released its own distribution. (Green Stack)

Terradata partners with Hortonworks.

Microsoft partnered with Hortonworks.

Intel

Tried to create their own distro.

Last week, dumped their distro, made large investment in to Cloudera.

IBM … Has its own distro, yet certifies their tools to run on Cloudera

Cisco partners with MapR

Amazon (AWS) has own distro, Partners with MapR.

HADOOP CLUSTERS SHOULD BE BUILT ON COMMODITY HARDWARE .

“Qu’ils mangent de la brioche.”

15

‘Let them eat cake’

Myth:Reality:YOU CAN DESIGN YOUR CLUSTER AROUND CONSTRAINTS…

PROJECT

DATE

CLIENT

ALTERNATIVE CLUSTER LAYOUTSTORAGE / COMPUTE CLUSTER

A Higher Density of Disk and Compute

ClusterPremium over

Commodity Hardware

I/O LatencyCould be part of a

virtualization solution.

HADOOP HADOOP IS OPEN SOURCE AND THEREFORE FREE.

“Qu’ils mangent de la brioche.”

17

‘Let them eat cake’

Myth:Reality:T.A.N.S.T.A.A.F.L ‘TANS - TAH - FELL’ (THERE AINT NO SUCH THING AS A FREE LUNCH )

There aint no such thing as a free lunch…

Customers are paying for support.

Tools are primitive, requires work, no real point and click solution in place, but getting there.

Hadoop fills the gap where you want a custom solution. Merging semi-structured and structured data is going to be data dependent, requiring customization.

Beyond ETL, SQL, custom apps require developer expertise. (You must invest in skills. )

Depending on Use Case, Time to Value (TtV) will differ.

Bottom Line, there is a cost reduction over traditional solutions, but its not free.

Take away…

Hadoop is a tool set that is constantly evolving.

Beware of marketing myths…

Do your own homework and talk to the vendors.

Make them earn your business.

T.A.S.T.A.A.F.L applies, you need to make an investment in terms of skills.

Hadoop isn’t a separate solution and should be part of your overall Enterprise strategy.

Hadoop isn’t a silver bullet. By itself, it doesn’t solve your business problems.

YOU CAN HAVE YOUR CAKE AND EAT IT TOO!

QUESTIONS?

Thank You For Your Time

What is a layer cake?

layer cake

noun [C] US

: two or more soft cakes put on top of each other with jam, cream, icing, etc. (= a sweet mixture made from sugar) between the cakes and covering the top and sides: a term for a diagram showing how various parts of a group of components tie together in terms of a functional stack.

22

What is Hadoop? Storage Layer

The Storage Layer is a Distributed File System that accomplishes the following:

Uniform Access from any machine in the cluster.

Fast Access (

Resiliency (Self Healing)

Redundancy (Replication)

This is known as HDFS - Hadoop File System

What is Hadoop? Job Control Layer

The Job Control Layer is the layer that accomplishes the following:

Manages and Schedules Jobs to be run. (Default [FIFO], Capacity Scheduler,

Manages the over all job, and distributes the subprocesses across the cluster.

Manages the subprocesses being run on each node in the cluster.

This is accomplished by a Job Tracker (Cluster level) and Task Tracker (Node Level)

What is Hadoop? Data Access Layer

The Data Access Layer is the layer that accomplishes the following:

Allows for a higher level access which can be translated to a Map/Reduce Job

Pig (Yahoo!)

Hive (Facebook)

Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)

What is Hadoop? Job Flow Control Layer

The Data Access Layer is the layer that accomplishes the following:

Allows for a higher level access which can be translated to a Map/Reduce Job

Pig (Yahoo!)

Hive (Facebook)

Allows for Adhoc access to data outside of the Map/Reduce Framework (HBase)

Allows for processes to be chained together to create a work flow (Oozie)*

*No where else to put it…

List of Apache Incubator Projects associated with Hadoop:

Storm

Accumulo

Knox

Sentry

Falcon

DataFu

Drill

Tez

Twill

Phoenix

Hadoop Dev Tools

Tajo

Recommended