100
ELEPHANT AT THE DOOR: MODERN DATA ARCHITECTURE Adam Muise – Solu/on Architect, Hortonworks

2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Embed Size (px)

DESCRIPTION

An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.

Citation preview

Page 1: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

ELEPHANT  AT  THE  DOOR:  MODERN  DATA  ARCHITECTURE  

Adam  Muise  –  Solu/on  Architect,  Hortonworks  

Page 2: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Who  am  I?  

Page 3: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Who  is                                        ?  

Page 4: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

We  do  Hadoop  

The  leaders  of  Hadoop’s  development  

Community  driven,    Enterprise  Focused  

Drive  Innova/on  in  the  plaForm  –  We  lead  the  roadmap    

100%  Open  Source  –  Democra/zed  Access  to  Data  

Page 5: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

We  do  Hadoop  successfully.  

Support    

Professional  Services  Training  

Page 6: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

What  is  Hadoop?    What  is  everyone  talking  about?  

Page 7: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Data  

Page 8: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

“Big  Data”  is  the  marke/ng  term  of  the  decade  in  IT  

Page 9: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

What  lurks  behind  the  hype  is  the  democra/za/on  of  Data.  

Page 10: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

You  need  data.    

Page 11: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Data  fuels  analy/cs.  Analy/cs  fuels  business  decisions.  

Page 12: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

So  we  save  the  data  because  we  think  we  need  it,  but  oTen  we  really  don’t  know  what  to  do  

with  it.  

Page 13: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

We  put  away  data,  delete  it,  tweet  it,  compress  it,  shred  it,  wikileak-­‐it,  put  it  in  a  database,  put  it  in  SAN/NAS,  put  it  in  the  cloud,  hide  it  in  

tape…  

Page 14: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

You  need  value  from  your  data.  You  need  to  make  decisions  from  your  

data.  

Page 15: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

So  what  are  the  problems  with  Big  Data?  

Page 16: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Let’s  talk  challenges…  

Page 17: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Volume  

Volume  

Volume  

Volume  

Page 18: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  

Volume  Volume   Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Page 19: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  

Volume  Volume   Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume   Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Page 20: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  

Volume  Volume   Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Volume  

Volume  Volume  

Volume  Volume  

Volume  

Volume  

Page 21: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Storage,  Management,  Processing  all  become  challenges  with  Data  at  

Volume  

Page 22: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tradi/onal  technologies  adopt  a  divide,  drop,  and  conquer  approach  

Page 23: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

The  solu/on?  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Yet  Another  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Analy/cal  DB  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data   OLTP  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Another  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Page 24: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Ummm…you  dropped  something  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Data  Data  

Data  

Data  Data  Data  

Data   Data  Data  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Yet  Another  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Analy/cal  DB  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

OLTP  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Another  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Page 25: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Analyzing  the  data  usually  raises  more  interes/ng  ques/ons…  

Page 26: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…which  leads  to  more  data  

Page 27: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Wait,  you’ve  seen  this  before.  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Data  Data  

Data  

Data  Data  Data  

Data   Data  Data  Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Analy/cs  Sausage  Factory  

Data   Data  Data  

Data  Data  Data  

Data   Data  Data   …  Data  

Data  Data  …  

Data  Data  

Data  

Data  

Page 28: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Data  begets  Data.  

Page 29: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

What  keeps  us  from  our  Data?  

Page 30: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

“Prices,  Stupid  passwords,  and  Boring  Sta/s/cs.”    -­‐  Hans  Rosling  

h)p://www.youtube.com/watch?v=hVimVzgtD6w  

Page 31: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Your  data  silos  are  lonely  places.  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Accounts  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Customers  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Web  Proper/es  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Page 32: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…  Data  likes  to  be  together.  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Accounts  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Customers  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Web  Proper/es  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Page 33: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Data  likes  to  socialize  too.  EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Accounts  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Customers  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Web  Proper/es  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Machine  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Twi^er  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Facebook  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

CDR  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Weather  Data  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  

Page 34: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

New  types  of  data  don’t  quite  fit  into  your  pris/ne  view  of  the  world.  

My  Li^le  Data  Empire  

Data  Data  Data  

Data  

Data  Data  

Data   Data  Data  

Logs  

Data  Data  Data  Data  

Data  

Data  Data  

Machine  Data  

Data  Data  Data  Data  

Data  

Data  Data  

?  ?  

?  ?  

Page 35: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

To  resolve  this,  some  people  take  hints  from  Lord  Of  The  Rings...  

Page 36: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…and  create  One-­‐Schema-­‐To-­‐Rule-­‐Them-­‐All…  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Schema  

Page 37: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…but  that  has  its  problems  too.  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Schema  Data  

Data  Data  

ETL   ETL  

ETL   ETL  

EDW  

Data  Data  Data  

Data  Data  Data  

Data   Data  Data  Schema  Data  

Data  Data  

ETL   ETL  

ETL   ETL  

Page 38: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

What  if  the  data  was  processed  and  stored  centrally?  What  if  you  didn’t  

need  to  force  it  into  a  single  schema?    

We  call  it  a  Data  Lake.  EDW  

Data  Data  Data  

Data  Data  

Data  Data  

Schema  

BI  &  Analy/cs   Schema   Schema  

Data  Data  Data  

Data  Lake  

Data  Data  Data  

Data  Data  Data  Data  

Data  Data  

Data  Data  Data  

Schema  Schema  

Data  Data  Data  

Process   Process  

Data  Data  Data  

Data  Data  Data  

Data  Data  Data  

Data  Data  Data  Data  Sources  

Data  Sources  

Page 39: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

A  Data  Lake  Architecture  enables:  -­‐  Landing  data  without  forcing  a  single  schema  -­‐  Landing  a  variety  and  large  volume  of  data  

 efficiently  -­‐  Retaining  data  for  a  long  period  of  /me  with  a  very  

 low  $/TB  -­‐  A  plaForm  to  feed  other  Analy/cal  DBs  -­‐  A  plaForm  to  execute  next  gen  data  analy/cs  and  

 processing  applica/ons  (SAS,  Informa/ca,    Graph  Analy/cs,  Machine  Learning,  SAP,    etc…)  

Page 40: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

In  most  cases,  more  data  is  be^er.  Work  with  the  popula/on,  not  just  a  

sample.  

Page 41: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Your  view  of  a  client  today.  

Male  

Female  

Age:  25-­‐30  

Town/City  

Middle  Income  Band  

Product  Category  Preferences  

Page 42: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Your  view  with  more  data.  

Male  

Female  

Age:  27  but  feels  old  

GPS  coordinates  

$65-­‐68k  per  year  

Product  recommenda/ons  

Tea  Party  Hippie  

Looking  to  start  a  business    

Walking  into  Starbucks  right  now…  

A  depressed  Toronto  Maple  Leaf’s  Fan  

Products  leT  in  basket  indicate  drunk  amazon  shopper  

Gene  Expression  for  Risk  Taker  

Thinking  about  a  new  house  

Unhappy  with  his  cell  phone  plan  

Pregnant  

Spent  25  minutes  looking  at  tea  cozies  

Page 43: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Pick  up  all  of  that  data  that  was  prohibi/vely  expensive  to  store  and  

use.      

Page 44: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Why  do  viewer  surveys…  

Page 45: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…when  raw  data  can  tell  you  what  bu^on  on  the  remote  was  pressed  during  what  commercial  for  the  

en/re  viewer  popula/on?  

Page 46: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Why  make  separate  risk  assessments  in  separate  data  silos…  

Page 47: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…when  you  can  do  a  risk  assessment  on  the  en/re  data  

footprint  of  the  client?  

Page 48: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

To  approach  these  use  cases  you  need  an  affordable  plaForm  that  stores,  processes,  and  analyzes  the  

data.    

Page 49: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

So  what  is  the  answer?  

Page 50: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Enter  the  Hadoop.  

h^p://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  

………  

Page 51: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Hadoop  was  created  because  tradi/onal  technologies  never  cut  it  

for  the  Internet  proper/es  like  Google,  Yahoo,  Facebook,  Twi^er,  

and  LinkedIn  

Page 52: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tradi/onal  architecture  didn’t  scale  enough…  

DB   DB  DB  

SAN  

App  App   App  App  

DB   DB  DB  

SAN  

App  App   App  App   DB   DB  DB  

SAN  

App  App   App  App  

Page 53: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Databases  can  become  bloated  and  useless  

Page 54: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tradi/onal  architectures  cost  too  much  at  that  volume…  

$/TB  

$pecial  Hardware  

$upercompu/ng  

Page 55: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

So  what  is  the  answer?  

Page 56: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

If  you  could  design  a  system  that  would  handle  this,  what  would  it  

look  like?  

Page 57: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

It  would  probably  need  a  highly  resilient,  self-­‐healing,  cost-­‐efficient,  

distributed  file  system…  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  

Page 58: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

It  would  probably  need  a  completely  parallel  processing  framework  that  

took  tasks  to  the  data…  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  

Page 59: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

It  would  probably  run  on  commodity  hardware,  virtualized  machines,  and  

common  OS  plaForms  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  

Page 60: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

It  would  probably  be  open  source  so  innova/on  could  happen  as  quickly  

as  possible  

Page 61: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

It  would  need  a  cri/cal  mass  of  users  

Page 62: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

{Processing  +  Storage}  =  

{MapReduce/Tez/YARN+  HDFS}  

Page 63: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

HDFS  stores  data  in  blocks  and  replicates  those  blocks  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  block3   block3  

block3  

block2   block2  

block2  

block1  

block1  

block1  

Page 64: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

If  a  block  fails  then  HDFS  always  has  the  other  copies  and  heals  itself  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  block3  

block3  

block3  

block2   block2  

block2  

block1  

block1  

block1  

X

Page 65: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

MapReduce  is  a  programming  paradigm  that  completely  parallel  

Mapper  

Mapper  

Mapper  

Mapper  

Mapper  

Reducer  

Reducer  

Reducer  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Page 66: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

MapReduce  has  three  phases:  Map,  Sort/Shuffle,  Reduce  

Mapper  

Mapper  

Mapper  

Mapper  

Mapper  

Reducer  

Reducer  

Reducer  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Key,  Value  Key,  Value  

Key,  Value  

Page 67: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

MapReduce  applies  to  a  lot  of  data  processing  problems  

Mapper  

Mapper  

Mapper  

Mapper  

Mapper  

Reducer  

Reducer  

Reducer  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Page 68: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

MapReduce  goes  a  long  way,  but  not  all  data  processing  and  analy/cs  

are  solved  the  same  way  

Page 69: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Some/mes  your  data  applica/on  needs  parallel  processing  and  inter-­‐

process  communica/on  

Process  

Process  

Process  

Process  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Page 70: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…like  Complex  Event  Processing  in  Apache  Storm  

Page 71: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Some/mes  your  machine  learning  data  applica/on  needs  to  process  in  

memory  and  iterate    

Process  

Process  

Process  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  Process   Process  

Process  

Process  

Data  

Data  Data  

Page 72: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

…like  in  Machine  Learning  in  Spark  

Page 73: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Introducing  Tez  

Page 74: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tez  is  a  YARN  applica/on,  like  MapReduce  is  a  YARN  applica/on  

Page 75: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tez  is  the  Lego  set  for  your  data  applica/on  

Page 76: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tez  provides  a  layer  for  abstract  tasks,  these  could  be  mappers,  reducers,  customized  stream  

processes,  in  memory  structures,  etc  

Page 77: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tez  can  chain  tasks  together  into  one  job  to  get  Map  –  Reduce  –  Reduce  jobs  

suitable  for  things  like  Hive  SQL  projec/ons,  group  by,  and  order  by  

TezMap  

TezMap  

TezMap  

TezMap  

TezMap  

TezReduce  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

Data  

Data  Data  

TezReduce  

TezReduce  

TezReduce  

TezReduce  

TezReduce  

Page 78: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Tez  can  provide  long-­‐running  containers  for  applica/ons  like  Hive  to  side-­‐step  batch  process  startups  you  would  have  with  MapReduce  

Page 79: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Introducing  YARN  

Page 80: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

YARN:    Yeah,  we  did  that  too.  

hortonworks.com/yarn/  

Page 81: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

YARN  =  Yet  Another  Resource  Nego/ator  

Page 82: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Resource  Manager  +  

Node  Managers  =  YARN  

Resource  Manager  

AppMaster  

Node  Manager  

Scheduler  

AppMaster  

AppMaster  

Node  Manager  

Node  Manager  

Node  Manager  

Container  

Container  

MapReduce  

Container  

Storm  

Container  

Container  

Container  

Pig  

Container  

Container  

Container  

Page 83: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

YARN  abstracts  resource  management  so  you  can  run  more  

than  just  MapReduce  

HDFS2  

MapReduce  V2  

YARN  MapReduce  V?   STORM  

MPI  Giraph   HBase  Tez   …  and  more  Spark  

Page 84: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Hadoop  has  other  open  source  projects…  

Page 85: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Hive  =  {SQL  -­‐>  Tez  ||  MapReduce}  SQL-­‐IN-­‐HADOOP  

Page 86: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Pig  =  {PigLa/n  -­‐>  Tez  ||  MapReduce}  

Page 87: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

HCatalog  =  {metadata*  for  MapReduce,  Hive,  Pig,  HBase}  

*metadata  =  tables,  columns,  par//ons,  types  

Page 88: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Oozie  =  Job::{Task,  Task,  if  Task,  then  Task,  final  Task}  

Page 89: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Falcon  

Hadoop   Hadoop  

Data  Set  

ReplicaAon  

Data  Set  

Data  Set  

Data  Set  

Data  Set  

Data  Set  

Data  Set  

Late  Data  Arrival  

Lineage  

RetenAon  Policy  

Process  Management  Audit  

Monitoring  Archival  

Page 90: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Knox  

Hadoop  Cluster  

REST  Client   Knox  Gateway  

Hadoop  Cluster  

REST  Client  

REST  Client  

Enterprise  LDAP  

Page 91: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Flume  

JMS  

Weblogs  

Events  

Files  

Hadoop  Flume  

Flume  

Flume  

Flume  

Flume  

Flume  

Page 92: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Sqoop  

Hadoop  

DB  DB  Sqoop  

Sqoop  

Page 93: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Ambari  =  {install,  manage,  monitor}  

Page 94: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

HBase  =  {real-­‐/me,  distributed-­‐map,  big-­‐tables}  

Page 95: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Storm  =  {Complex  Event  Processing,  Near-­‐Real-­‐Time,  Provisioned  by  

YARN  }  

Page 96: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Apache  Hadoop  

Flume  Ambari  

HBase  Falcon  

MapReduce  HDFS  

Sqoop  HCatalog  

Pig  

Hive  

Storm  YARN  

Knox  

Tez  

Page 97: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Hortonworks  Data  PlaForm  

Flume  Ambari  

HBase  Falcon  

MapReduce  HDFS  

Sqoop  HCatalog  

Pig  

Hive  

Storm   YARN  

Knox  

Tez  

Page 98: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

What  else  are  we  working  on?  

hortonworks.com/labs/  

Page 99: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

Hadoop  is  the  new  Modern  Data  Architecture  for  the  Enterprise  

Page 100: 2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page  100  

There is NO second place

Hortonworks  …the  Bull  Elephant  of  Hadoop  InnovaCon