Upload
adam-muise
View
105
Download
0
Embed Size (px)
DESCRIPTION
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
Citation preview
ELEPHANT AT THE DOOR: MODERN DATA ARCHITECTURE
Adam Muise – Solu/on Architect, Hortonworks
Who am I?
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive Innova/on in the plaForm – We lead the roadmap
100% Open Source – Democra/zed Access to Data
We do Hadoop successfully.
Support
Professional Services Training
What is Hadoop? What is everyone talking about?
Data
“Big Data” is the marke/ng term of the decade in IT
What lurks behind the hype is the democra/za/on of Data.
You need data.
Data fuels analy/cs. Analy/cs fuels business decisions.
So we save the data because we think we need it, but oTen we really don’t know what to do
with it.
We put away data, delete it, tweet it, compress it, shred it, wikileak-‐it, put it in a database, put it in SAN/NAS, put it in the cloud, hide it in
tape…
You need value from your data. You need to make decisions from your
data.
So what are the problems with Big Data?
Let’s talk challenges…
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume
Volume Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Volume
Volume Volume
Volume Volume
Volume
Volume
Storage, Management, Processing all become challenges with Data at
Volume
Tradi/onal technologies adopt a divide, drop, and conquer approach
The solu/on? EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Ummm…you dropped something
Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data
EDW
Data Data Data
Data Data Data
Data Data Data
Yet Another EDW
Data Data Data
Data Data Data
Data Data Data
Analy/cal DB
Data Data Data
Data Data Data
Data Data Data
OLTP
Data Data Data
Data Data Data
Data Data Data
Another EDW
Data Data Data
Data Data Data
Data Data Data
Analyzing the data usually raises more interes/ng ques/ons…
…which leads to more data
Wait, you’ve seen this before.
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Data
Data
Data Data Data
Data Data Data Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Analy/cs Sausage Factory
Data Data Data
Data Data Data
Data Data Data … Data
Data Data …
Data Data
Data
Data
Data begets Data.
What keeps us from our Data?
“Prices, Stupid passwords, and Boring Sta/s/cs.” -‐ Hans Rosling
h)p://www.youtube.com/watch?v=hVimVzgtD6w
Your data silos are lonely places.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
… Data likes to be together.
EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
Data likes to socialize too. EDW
Data Data Data
Data Data Data
Data Data Data
Accounts
Data Data Data
Data Data Data
Data Data Data
Customers
Data Data Data
Data Data Data
Data Data Data
Web Proper/es
Data Data Data
Data Data Data
Data Data Data
Machine Data
Data Data Data
Data Data Data
Data Data Data
Twi^er
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
Data Data Data
CDR
Data Data Data
Data Data Data
Data Data Data
Weather Data
Data Data Data
Data Data Data
Data Data Data
New types of data don’t quite fit into your pris/ne view of the world.
My Li^le Data Empire
Data Data Data
Data
Data Data
Data Data Data
Logs
Data Data Data Data
Data
Data Data
Machine Data
Data Data Data Data
Data
Data Data
? ?
? ?
To resolve this, some people take hints from Lord Of The Rings...
…and create One-‐Schema-‐To-‐Rule-‐Them-‐All…
EDW
Data Data Data
Data Data Data
Data Data Data Schema
…but that has its problems too.
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
EDW
Data Data Data
Data Data Data
Data Data Data Schema Data
Data Data
ETL ETL
ETL ETL
What if the data was processed and stored centrally? What if you didn’t
need to force it into a single schema?
We call it a Data Lake. EDW
Data Data Data
Data Data
Data Data
Schema
BI & Analy/cs Schema Schema
Data Data Data
Data Lake
Data Data Data
Data Data Data Data
Data Data
Data Data Data
Schema Schema
Data Data Data
Process Process
Data Data Data
Data Data Data
Data Data Data
Data Data Data Data Sources
Data Sources
A Data Lake Architecture enables: -‐ Landing data without forcing a single schema -‐ Landing a variety and large volume of data
efficiently -‐ Retaining data for a long period of /me with a very
low $/TB -‐ A plaForm to feed other Analy/cal DBs -‐ A plaForm to execute next gen data analy/cs and
processing applica/ons (SAS, Informa/ca, Graph Analy/cs, Machine Learning, SAP, etc…)
In most cases, more data is be^er. Work with the popula/on, not just a
sample.
Your view of a client today.
Male
Female
Age: 25-‐30
Town/City
Middle Income Band
Product Category Preferences
Your view with more data.
Male
Female
Age: 27 but feels old
GPS coordinates
$65-‐68k per year
Product recommenda/ons
Tea Party Hippie
Looking to start a business
Walking into Starbucks right now…
A depressed Toronto Maple Leaf’s Fan
Products leT in basket indicate drunk amazon shopper
Gene Expression for Risk Taker
Thinking about a new house
Unhappy with his cell phone plan
Pregnant
Spent 25 minutes looking at tea cozies
Pick up all of that data that was prohibi/vely expensive to store and
use.
Why do viewer surveys…
…when raw data can tell you what bu^on on the remote was pressed during what commercial for the
en/re viewer popula/on?
Why make separate risk assessments in separate data silos…
…when you can do a risk assessment on the en/re data
footprint of the client?
To approach these use cases you need an affordable plaForm that stores, processes, and analyzes the
data.
So what is the answer?
Enter the Hadoop.
h^p://www.fabulouslybroke.com/2011/05/ninja-‐elephants-‐and-‐other-‐awesome-‐stories/
………
Hadoop was created because tradi/onal technologies never cut it
for the Internet proper/es like Google, Yahoo, Facebook, Twi^er,
and LinkedIn
Tradi/onal architecture didn’t scale enough…
DB DB DB
SAN
App App App App
DB DB DB
SAN
App App App App DB DB DB
SAN
App App App App
Databases can become bloated and useless
Tradi/onal architectures cost too much at that volume…
$/TB
$pecial Hardware
$upercompu/ng
So what is the answer?
If you could design a system that would handle this, what would it
look like?
It would probably need a highly resilient, self-‐healing, cost-‐efficient,
distributed file system…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage
It would probably need a completely parallel processing framework that
took tasks to the data…
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably run on commodity hardware, virtualized machines, and
common OS plaForms
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing
It would probably be open source so innova/on could happen as quickly
as possible
It would need a cri/cal mass of users
{Processing + Storage} =
{MapReduce/Tez/YARN+ HDFS}
HDFS stores data in blocks and replicates those blocks
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3 block3
block3
block2 block2
block2
block1
block1
block1
If a block fails then HDFS always has the other copies and heals itself
Storage Storage Storage
Storage Storage Storage
Storage Storage Storage Processing Processing Processing
Processing Processing Processing
Processing Processing Processing block3
block3
block3
block2 block2
block2
block1
block1
block1
X
MapReduce is a programming paradigm that completely parallel
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
MapReduce has three phases: Map, Sort/Shuffle, Reduce
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value Key, Value
Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
Key, Value Key, Value
Key, Value
MapReduce applies to a lot of data processing problems
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
MapReduce goes a long way, but not all data processing and analy/cs
are solved the same way
Some/mes your data applica/on needs parallel processing and inter-‐
process communica/on
Process
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
…like Complex Event Processing in Apache Storm
Some/mes your machine learning data applica/on needs to process in
memory and iterate
Process
Process
Process
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data Process Process
Process
Process
Data
Data Data
…like in Machine Learning in Spark
Introducing Tez
Tez is a YARN applica/on, like MapReduce is a YARN applica/on
Tez is the Lego set for your data applica/on
Tez provides a layer for abstract tasks, these could be mappers, reducers, customized stream
processes, in memory structures, etc
Tez can chain tasks together into one job to get Map – Reduce – Reduce jobs
suitable for things like Hive SQL projec/ons, group by, and order by
TezMap
TezMap
TezMap
TezMap
TezMap
TezReduce
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
Data
Data Data
TezReduce
TezReduce
TezReduce
TezReduce
TezReduce
Tez can provide long-‐running containers for applica/ons like Hive to side-‐step batch process startups you would have with MapReduce
Introducing YARN
YARN: Yeah, we did that too.
hortonworks.com/yarn/
YARN = Yet Another Resource Nego/ator
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
YARN abstracts resource management so you can run more
than just MapReduce
HDFS2
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and more Spark
Hadoop has other open source projects…
Hive = {SQL -‐> Tez || MapReduce} SQL-‐IN-‐HADOOP
Pig = {PigLa/n -‐> Tez || MapReduce}
HCatalog = {metadata* for MapReduce, Hive, Pig, HBase}
*metadata = tables, columns, par//ons, types
Oozie = Job::{Task, Task, if Task, then Task, final Task}
Falcon
Hadoop Hadoop
Data Set
ReplicaAon
Data Set
Data Set
Data Set
Data Set
Data Set
Data Set
Late Data Arrival
Lineage
RetenAon Policy
Process Management Audit
Monitoring Archival
Knox
Hadoop Cluster
REST Client Knox Gateway
Hadoop Cluster
REST Client
REST Client
Enterprise LDAP
Flume
JMS
Weblogs
Events
Files
Hadoop Flume
Flume
Flume
Flume
Flume
Flume
Sqoop
Hadoop
DB DB Sqoop
Sqoop
Ambari = {install, manage, monitor}
HBase = {real-‐/me, distributed-‐map, big-‐tables}
Storm = {Complex Event Processing, Near-‐Real-‐Time, Provisioned by
YARN }
Apache Hadoop
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
Hortonworks Data PlaForm
Flume Ambari
HBase Falcon
MapReduce HDFS
Sqoop HCatalog
Pig
Hive
Storm YARN
Knox
Tez
What else are we working on?
hortonworks.com/labs/
Hadoop is the new Modern Data Architecture for the Enterprise
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 100
There is NO second place
Hortonworks …the Bull Elephant of Hadoop InnovaCon