2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

ELEPHANT AT THE DOOR: MODERN DATA ARCHITECTURE

Adam Muise – Solu/on Architect, Hortonworks

Who am I?

Who is ?

We do Hadoop

The leaders of Hadoop’s development

Community driven, Enterprise Focused

Drive Innova/on in the plaForm – We lead the roadmap

100% Open Source – Democra/zed Access to Data

We do Hadoop successfully.

Support

Professional Services Training

What is Hadoop? What is everyone talking about?

Data

“Big Data” is the marke/ng term of the decade in IT

What lurks behind the hype is the democra/za/on of Data.

You need data.

Data fuels analy/cs. Analy/cs fuels business decisions.

So we save the data because we think we need it, but oTen we really don’t know what to do

with it.

We put away data, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in

tape…

You need value from your data. You need to make decisions from your

data.

So what are the problems with Big Data?

Let’s talk challenges…

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume

Volume Volume Volume

Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume


Volume

Volume Volume

Volume Volume


Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume


Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Volume

Volume Volume

Volume Volume

Volume

Volume

Storage, Management, Processing all become challenges with Data at

Volume

Tradi/onal technologies adopt a divide, drop, and conquer approach

The solu/on? EDW

Data Data Data

Data Data Data

Data Data Data

Yet Another EDW

Data Data Data

Data Data Data

Data Data Data

Analy/cal DB

Data Data Data

Data Data Data

Data Data Data OLTP

Data Data Data

Data Data Data

Data Data Data

Another EDW

Data Data Data

Data Data Data

Data Data Data

Ummm…you dropped something

Data Data Data

Data Data Data

Data Data Data Data Data Data

Data Data Data


Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data Data Data

Data

Data Data Data

Data Data Data

EDW

Data Data Data

Data Data Data

Data Data Data

Yet Another EDW

Data Data Data

Data Data Data

Data Data Data

Analy/cal DB

Data Data Data

Data Data Data

Data Data Data

OLTP

Data Data Data

Data Data Data

Data Data Data

Another EDW

Data Data Data

Data Data Data

Data Data Data

Analyzing the data usually raises more interes/ng ques/ons…

…which leads to more data

Wait, you’ve seen this before.

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data Data Data

Data

Data Data Data


Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Data Data Data

Analy/cs Sausage Factory

Data Data Data

Data Data Data

Data Data Data … Data

Data Data …

Data Data

Data

Data

Data begets Data.

What keeps us from our Data?

“Prices, Stupid passwords, and Boring Sta/s/cs.” -‐ Hans Rosling

h)p://www.youtube.com/watch?v=hVimVzgtD6w

Your data silos are lonely places.

EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper/es

Data Data Data

Data Data Data

Data Data Data

… Data likes to be together.

EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper/es

Data Data Data

Data Data Data

Data Data Data

Data likes to socialize too. EDW

Data Data Data

Data Data Data

Data Data Data

Accounts

Data Data Data

Data Data Data

Data Data Data

Customers

Data Data Data

Data Data Data

Data Data Data

Web Proper/es

Data Data Data

Data Data Data

Data Data Data

Machine Data

Data Data Data

Data Data Data

Data Data Data

Twiêr

Data Data Data

Data Data Data

Data Data Data

Facebook

Data Data Data

Data Data Data

Data Data Data

CDR

Data Data Data

Data Data Data

Data Data Data

Weather Data

Data Data Data

Data Data Data

Data Data Data

New types of data don’t quite fit into your pris/ne view of the world.

My Li^le Data Empire

Data Data Data

Data

Data Data

Data Data Data

Logs

Data Data Data Data

Data

Data Data

Machine Data

Data Data Data Data

Data

Data Data

? ?

? ?

To resolve this, some people take hints from Lord Of The Rings...

…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…

EDW

Data Data Data

Data Data Data

Data Data Data Schema

…but that has its problems too.

EDW

Data Data Data

Data Data Data

Data Data Data Schema Data

Data Data

ETL ETL

ETL ETL

EDW

Data Data Data

Data Data Data

Data Data Data Schema Data

Data Data

ETL ETL

ETL ETL

What if the data was processed and stored centrally? What if you didn’t

need to force it into a single schema?

We call it a Data Lake. EDW

Data Data Data

Data Data

Data Data

Schema

BI & Analy/cs Schema Schema

Data Data Data

Data Lake

Data Data Data

Data Data Data Data

Data Data

Data Data Data

Schema Schema

Data Data Data

Process Process

Data Data Data

Data Data Data

Data Data Data

Data Data Data Data Sources

Data Sources

A Data Lake Architecture enables: -‐ Landing data without forcing a single schema -‐ Landing a variety and large volume of data

efficiently -‐ Retaining data for a long period of /me with a very

low $/TB -‐ A plaForm to feed other Analy/cal DBs -‐ A plaForm to execute next gen data analy/cs and

processing applica/ons (SAS, Informa/ca, Graph Analy/cs, Machine Learning, SAP, etc…)

In most cases, more data is beêr. Work with the popula/on, not just a

sample.

Your view of a client today.

Male

Female

Age: 25-‐30

Town/City

Middle Income Band

Product Category Preferences

Your view with more data.

Male

Female

Age: 27 but feels old

GPS coordinates

$65-‐68k per year

Product recommenda/ons

Tea Party Hippie

Looking to start a business

Walking into Starbucks right now…

A depressed Toronto Maple Leaf’s Fan

Products leT in basket indicate drunk amazon shopper

Gene Expression for Risk Taker

Thinking about a new house

Unhappy with his cell phone plan

Pregnant

Spent 25 minutes looking at tea cozies

Pick up all of that data that was prohibi/vely expensive to store and

use.

Why do viewer surveys…

…when raw data can tell you what buôn on the remote was pressed during what commercial for the

en/re viewer popula/on?

Why make separate risk assessments in separate data silos…

…when you can do a risk assessment on the en/re data

footprint of the client?

To approach these use cases you need an affordable plaForm that stores, processes, and analyzes the

data.

So what is the answer?

Enter the Hadoop.

h^p://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/

………

Hadoop was created because tradi/onal technologies never cut it

for the Internet proper/es like Google, Yahoo, Facebook, Twiêr,

and LinkedIn

Tradi/onal architecture didn’t scale enough…

DB DB DB

SAN

App App App App

DB DB DB

SAN

App App App App DB DB DB

SAN

App App App App

Databases can become bloated and useless

Tradi/onal architectures cost too much at that volume…

$/TB

$pecial Hardware

$upercompu/ng

So what is the answer?

If you could design a system that would handle this, what would it

look like?

It would probably need a highly resilient, self-‐healing, cost-‐efficient,

distributed file system…

Storage Storage Storage



It would probably need a completely parallel processing framework that

took tasks to the data…



Storage Storage Storage Processing Processing Processing

Processing Processing Processing


It would probably run on commodity hardware, virtualized machines, and

common OS plaForms






It would probably be open source so innova/on could happen as quickly

as possible

It would need a cri/cal mass of users

{Processing + Storage} =

{MapReduce/Tez/YARN+ HDFS}

HDFS stores data in blocks and replicates those blocks





Processing Processing Processing block3 block3

block3

block2 block2

block2

block1

block1

block1

If a block fails then HDFS always has the other copies and heals itself





Processing Processing Processing block3

block3

block3

block2 block2

block2

block1

block1

block1

X

MapReduce is a programming paradigm that completely parallel

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

MapReduce has three phases: Map, Sort/Shuffle, Reduce

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Key, Value Key, Value

Key, Value


Key, Value


Key, Value


Key, Value


Key, Value



Key, Value

Key, Value


Key, Value


Key, Value


Key, Value


Key, Value

MapReduce applies to a lot of data processing problems

Mapper

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

MapReduce goes a long way, but not all data processing and analy/cs

are solved the same way

Some/mes your data applica/on needs parallel processing and inter-‐

process communica/on

Process

Process

Process

Process

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

…like Complex Event Processing in Apache Storm

Some/mes your machine learning data applica/on needs to process in

memory and iterate

Process

Process

Process

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data Process Process

Process

Process

Data

Data Data

…like in Machine Learning in Spark

Introducing Tez

Tez is a YARN applica/on, like MapReduce is a YARN applica/on

Tez is the Lego set for your data applica/on

Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream

processes, in memory structures, etc

Tez can chain tasks together into one job to get Map – Reduce – Reduce jobs

suitable for things like Hive SQL projec/ons, group by, and order by

TezMap

TezMap

TezMap

TezMap

TezMap

TezReduce

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

Data

Data Data

TezReduce

TezReduce

TezReduce

TezReduce

TezReduce

Tez can provide long-‐running containers for applica/ons like Hive to side-‐step batch process startups you would have with MapReduce

Introducing YARN

YARN: Yeah, we did that too.

hortonworks.com/yarn/

YARN = Yet Another Resource Nego/ator

Resource Manager +

Node Managers = YARN

Resource Manager

AppMaster

Node Manager

Scheduler

AppMaster

AppMaster

Node Manager

Node Manager

Node Manager

Container

Container

MapReduce

Container

Storm

Container

Container

Container

Pig

Container

Container

Container

YARN abstracts resource management so you can run more

than just MapReduce

HDFS2

MapReduce V2

YARN MapReduce V? STORM

MPI Giraph HBase Tez … and more Spark

Hadoop has other open source projects…

Hive = {SQL -‐> Tez || MapReduce} SQL-‐IN-‐HADOOP

Pig = {PigLa/n -‐> Tez || MapReduce}

HCatalog = {metadata* for MapReduce, Hive, Pig, HBase}

*metadata = tables, columns, par//ons, types

Oozie = Job::{Task, Task, if Task, then Task, final Task}

Falcon

Hadoop Hadoop

Data Set

ReplicaAon

Data Set

Data Set

Data Set

Data Set

Data Set

Data Set

Late Data Arrival

Lineage

RetenAon Policy

Process Management Audit

Monitoring Archival

Knox

Hadoop Cluster

REST Client Knox Gateway

Hadoop Cluster

REST Client

REST Client

Enterprise LDAP

Flume

JMS

Weblogs

Events

Files

Hadoop Flume

Flume

Flume

Flume

Flume

Flume

Sqoop

Hadoop

DB DB Sqoop

Sqoop

Ambari = {install, manage, monitor}

HBase = {real-‐/me, distributed-‐map, big-‐tables}

Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by

YARN }

Apache Hadoop

Flume Ambari

HBase Falcon

MapReduce HDFS

Sqoop HCatalog

Pig

Hive

Storm YARN

Knox

Tez

Hortonworks Data PlaForm

Flume Ambari

HBase Falcon

MapReduce HDFS

Sqoop HCatalog

Pig

Hive

Storm YARN

Knox

Tez

What else are we working on?

hortonworks.com/labs/

Hadoop is the new Modern Data Architecture for the Enterprise

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 100

There is NO second place

Hortonworks …the Bull Elephant of Hadoop InnovaCon

Technology

2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture