35
Eliminating the Challenges of Big Data Management Inside Hadoop

Eliminating the Challenges of Big Data Management Inside Hadoop

Embed Size (px)

Citation preview

Eliminating the Challenges of Big Data Management Inside Hadoop

2 © RedPoint Global Inc. 2015 Confidential

Today’s Speakers

Justin Sears, Senior Manager, Product Marketing, Hortonworks

Jamie Keeffe, Product Marketing Manager, RedPoint Global

Kris Tomes, Solutions Director, RedPoint Global

 

Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hortonworks: Hadoop for the Enterprise We Do Hadoop

Spring 2015

Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum

•  330+ customers (as of year-end 2014)

Hortonworks Data Platform •  Completely open multi-tenant platform for any app & any data. •  A centralized architecture of consistent enterprise services for

resource management, security, operations, and governance.

Partner for Customer Success •  Open source community leadership focus on enterprise needs •  Unrivaled world class support

•  Founded in 2011 •  Original 24 architects, developers,

operators of Hadoop from Yahoo! •  600+ Employees •  1000+ Ecosystem Partners

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Spring 2015

Hortonworks. We do Hadoop.

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Traditional systems under pressure Challenges •  Constrains data to app •  Can’t manage new data •  Costly to Scale

Business Value

Clickstream

Geolocation

Web Data

Internet of Things

Docs, emails

Server logs

2012 2.8 Zettabytes

2020 40 Zettabytes

LAGGARDS

INDUSTRY LEADERS

1

2 New Data

ERP CRM SCM

New

Traditional

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop emerged as foundation of new data architecture

Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data •  Built by Yahoo! to be the heartbeat of its ad & search business

•  Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises

•  Incredibly disruptive to current platform economics

Traditional Hadoop Advantages ü  Manages new data paradigm ü  Handles data at scale ü  Cost effective ü  Open source

Traditional Hadoop Had Limitations " Batch-only architecture " Single purpose clusters, specific data sets " Difficult to integrate with existing investments " Not enterprise-grade

Application

Storage HDFS

Batch Processing MapReduce

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Modern Data Architecture emerges to unify data & processing

Modern Data Architecture •  Enable applications to have access to

all your enterprise data through an efficient centralized platform

•  Supported with a centralized approach governance, security and operations

•  Versatile to handle any applications and datasets no matter the size or type

Clickstream   Web    &  Social  

Geoloca3on   Sensor    &  Machine  

Server    Logs  

Unstructured  

SOU

RC

ES

Existing Systems

ERP   CRM   SCM  

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization & Dashboards

AN

ALY

TIC

S

Applications Business Analytics

Visualization & Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-Time Batch Partner ISV Batch Batch MPP   EDW  

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Modern Data Architecture emerges to unify data & processing

Clickstream   Web    &  Social  

Geoloca3on   Sensor    &  Machine  

Server    Logs  

Unstructured  

SOU

RC

ES

Existing Systems

ERP   CRM   SCM  

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization & Dashboards

AN

ALY

TIC

S

Applications Business Analytics

Visualization & Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-Time Batch Partner ISV Batch Batch MPP   EDW  

RedPoint  Global  is  a  Hortonworks  Partner,  cer3fied  on  HDP  and  YARN.    With  RedPoint,  your  exis:ng  data  analysts  and  database  administrators  can  easily  work  with  data  stored  in  Hadoop.  No  new  skills  are  required.  

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop adoption follows a predictable journey Cost Optimization, new analytic apps, and ultimately to a “data lake”

Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop Driver: Cost optimization

Archive Data off EDW Move rarely used data to Hadoop as active archive, store more data longer

Offload costly ETL process Free your EDW to perform high-value functions like analytics & operations, not ETL

Enrich the value of your EDW Use Hadoop to refine new data sources, such as web and machine data for new analytical context

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization & Dashboards

HDP helps you reduce costs and optimize the value associated with your EDW

AN

ALY

TIC

S D

ATA

SYST

EMS

Data Marts

Business Analytics

Visualization & Dashboards

HDP 2.2

ELT °

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

Cold Data, Deeper Archive & New Sources

Enterprise Data Warehouse

Hot

MPP

In-Memory

Clickstream   Web    &  Social  

Geoloca3on   Sensor    &  Machine  

Server    Logs  

Unstructured  

Existing Systems

ERP   CRM   SCM  

SOU

RC

ES

Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Single View Improve acquisition and retention

Predictive Analytics Identify your next best action

Data Discovery Uncover new findings

Financial Services

New Account Risk Screens Trading Risk Insurance Underwriting

Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service

Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement

Telecom Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse

Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis

Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers

Retail 360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase

Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs

Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior

Manufacturing Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data

Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance

Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields

Healthcare Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials

Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste

Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service

Oil & Gas Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration

DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells

Government Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness

Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting

Hadoop Driver: Advanced analytic applications

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hadoop Driver: Enabling the data lake SC

ALE

SCOPE

Data Lake Definition •  Centralized Architecture

Multiple applications on a shared data set with consistent levels of service

•  Any App, Any Data Multiple applications accessing all data affording new insights and opportunities.

•  Unlocks ‘Systems of Insight’ Advanced algorithms and applications used to derive new value and optimize existing value.

Drivers: 1.  Cost Optimization 2.  Advanced Analytic Apps

Goal: •  Centralized Architecture •  Data-driven Business

DATA LAKE

Journey to the Data Lake with Hadoop

Systems of Insight

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Case Study: 12 month Hadoop evolution at TrueCar D

ata

Plat

form

Cap

abili

ties

12 months execution plan

June 2013 Begin Hadoop Execution

July 2013 Hortonworks Partnership

May ‘14 IPO

Aug 2013 Training & Dev Begins

Nov 2013 Production Cluster 60 Nodes 2 PB

Jan 2014 40% Dev Staff Perficient

Dec 2013 Three Production Apps (3 total)

Feb 2014 Three More Production Apps (6 total)

12 Month Results at TRUECar •  Six Production Hadoop Applications •  Sixty nodes/2PB data •  Storage Costs/Compute Costs

from $19/GB to $0.23/GB

“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hortonworks Data Platform Hadoop for the Enterprise

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Only HDP delivers a Centralized Architecture HDP is uniquely built around YARN serving as a data operating system that provides multi-tenant Resource Management, consistent Governance & Security and efficient Operations services across Hadoop applications.

Hortonworks Data Platform

YARN Data Operating System •  A centralized architecture of

consistent enterprise services for resource management, security, operations, and governance.

•  The versatility to support multiple applications and diverse workloads from batch to interactive to real-time, open source and commercial.

Key Benefits

•  Multiple applications on a shared data set with consistent levels of service: a multitenant data platform.

•  Provides a shared platform to enable new analytic applications.

•  Delivers maximum cost efficiency for cluster resource management. Fewer servers fewer nodes.

Storage

YARN: Data Operating System

Governance Security

Operations

Resource Management

Existing Applications

New Analytics

Partner Applications

Data Access: Batch, Interactive & Real-time

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP delivers a completely open data platform

Hortonworks Data Platform 2.2

Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data.

Completely Open

•  HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations

•  All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem.

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Apa

che

Pig

° °

° °

° ° °

° ° °

HDFS (Hadoop Distributed File System)

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

Apache Falcon

Apa

che

Hiv

e C

asca

ding

A

pach

e H

Bas

e A

pach

e A

ccum

ulo

Apa

che

Sol

r A

pach

e S

park

Apa

che

Sto

rm

Apache Sqoop

Apache Flume

Apache Kafka

SECURITY

Apache Ranger

Apache Knox

Apache Falcon

OPERATIONS

Apache Ambari

Apache Zookeeper

Apache Oozie

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDP: Any Data, Any Application, Anywhere

Any Application •  Deep integration with ecosystem

partners to extend existing investments and skills

•  Broadest set of applications through the stable of YARN-Ready applications

Any Data Deploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets.

Anywhere Implement HDP naturally across the complete range of deployment options

Clickstream   Web    &  Social  

Geoloca3on   Internet  of  Things  

Server    Logs  

Files,  emails  ERP   CRM   SCM  

hybrid

commodity appliance cloud

Over 70 Hortonworks Certified YARN Apps

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Expansion

Architecture & Development

Production

Implementation

Hortonworks supports the full application lifecycle Hadoop usage follows a consistent lifecycle From architecture to expansion, all with a consistent support experience

Most Common Support Issues by Project Phase Issues address by Hortonworks Support by type for the past year

Issue Type Architecture 7%

Application Development   10%

Installation   10%

Performance   5%

Configuration   25%

Executing Jobs   20%

Cluster Administration   18%

HDP Upgrades   3%

Enhancement Requests   3%

TOTAL 100%

Hortonworks Support

Full Lifecycle Subscription Support

Support through EVERY phase of adoption of your Hadoop project to ensure your success

# tickets

Project 2

Project 3

Project N

.

.

.

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

“Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR

A Leader in Hadoop

The Forrester Wave™ Big Data Hadoop Solutions Q1 2014

Eliminating the Challenges of Big Data Management Inside Hadoop

22 © RedPoint Global Inc. 2015 Confidential

Overview of RedPoint Global

" Launched  2006  

" Founded  and  staffed  by  industry  veterans  

" Headquarters:  Wellesley,  MassachuseJs  

" Offices  in  US,  UK,  Australia,  Philippines  

" Global  customer  base  

" Serves  most  major  industries  

 

MAGIC  QUADRANT  Data  Quality    

MAGIC  QUADRANT  Mul:channel  Campaign  

Management  

MAGIC  QUADRANT  Integrated  Marke:ng  

Management  

23 © RedPoint Global Inc. 2015 Confidential

Andrew Brust, GigaOm Research

24 © RedPoint Global Inc. 2015 Confidential

New Data Straining Current Architectures

Unstructured  documents,  emails

Transac:onal  data

Server  logs Sen:ment,  web  data

Geoloca:on Sensor,  machine  data Clickstream

Hierarchical  data OLTP,  ERP,  CRM Master  data 2.8  ZB  in  2013  

85%  from  new  data  types  

15x  Machine  Data  by  2020  

40  ZB  by  2020  

Source: IDC

25 © RedPoint Global Inc. 2015 Confidential

Key Functions for Data Management

Master  Key  Management  

ETL  &  ELT   Data  Quality  

Web  Services  Integra:on  

Integra:on  &  Matching  

Process  Automa:on    &  Opera:ons  

•  Profiling,  reads/writes,  transforma:ons  

•  Single  project  for  all  jobs  

•  Cleanse  data  •  Parsing,  correc:on  • Geo-­‐spa:al  analysis  

• Grouping  •  Fuzzy  match  

•  Create  keys  •  Track  changes  • Maintain  matches    over  :me  

•  Consume  and  publish  • HTTP/HTTPS  protocols  •  XML/JSON/SOAP  formats  

•  Job  scheduling,  monitoring,  no:fica:ons  

•  Central  point  of  control  • Meta  Data  Management  

26 © RedPoint Global Inc. 2015 Confidential

Overview - What is Hadoop?

Hadoop  1.0  •  All  opera:ons  based  on  Map  Reduce  

•  Intrinsic  inconsistency  of  code  based  solu:ons  

•  Highly  skilled  and  expensive  resources  needed  

•  3rd  party  applica:ons  constrained  by  the  need  to  generate  code  

Hadoop  2.0  •  Introduc:on  of  the  YARN:                                                          “a  general-­‐purpose,  distributed,  applica:on  management  framework  that  supersedes  the  classic  Apache  Hadoop  MapReduce  framework  for  processing  data  in  Hadoop  clusters.”  

•  Mature  applica:ons  can  now  operate  directly  on  Hadoop  

•  Reduce  skill  requirements  and  increased  consistency  

               

HDFS  (Hadoop  Distributed  File  System)  

YARN:    Data  Opera3ng  System  

Batch  MapReduce  

Batch  &  Interac3ve  Tez  

Real-­‐Time  Slider  

Spark  Other  ISV  

 Other    ISV    

 Stream  

   

Storm      

 NoSQL  

     

HBase  Accumulo  

 

 Other    ISV    

 Cascading  

 

Scala  Java      

 SQL  

 

Hive        

 Scrip3ng  

 

Pig        

 Direct  

 

Java  .NET        

API  

Engine  

System  

HADOOP  2.0  

27 © RedPoint Global Inc. 2015 Confidential

RedPoint Data Management on Hadoop

Par::oning  AM  /  Tasks  

Execu:on  AM  /  Tasks   Data  I/O   Key  /  Split  

Analysis  

Parallel  Sec:on  

YARN  

MapReduce  

28 © RedPoint Global Inc. 2015 Confidential

Resource  Manager  

Launches  Tasks  

Node  Manager  

DM  App  Master  

DM  Task  

Node  Manager  

DM  Task  

DM  Task  

Node  Manager  

DM  Task  

DM  Task  

Launches  DM  App  Master  

Data  Management  Designer  

DM  Execu3on  

Server  

Parallel  Sec:on  

Running  DM  Task  

12

3

RedPoint DM for Hadoop: Processing Flow

29 © RedPoint Global Inc. 2015 Confidential

>150  Lines  of  MR  Code   ~50  Lines  of  Script  Code   0  Lines  of  Code  

6  hours  of  development   3  hours  of  development   15  min.  of  development  

6  minutes  run:me   15  minutes  run:me   3  minutes  run:me  

Extensive  op:miza:on  needed  

User  Defined  Func:ons  required  prior  to  running  script  

No  tuning  or  op:miza:on  required  

RedPoint  

Benchmarks – Project Gutenberg

Map  Reduce   Pig  

Sample  MapReduce  (small  subset  of  the  entire  code  which  totals  nearly  150  lines):  public  static  class  MapClass extends  Mapper<WordOffset, Text, Text, IntWritable> {   private  final  static  String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿";   private  final  static  IntWritable one = new  IntWritable(1);   private  Text word = new  Text();   public  void  map(WordOffset key, Text value, Context context) throws  IOException, InterruptedException { String line = value.toString();   StringTokenizer itr = new  StringTokenizer(line, delimiters);   while  (itr.hasMoreTokens()) {   word.set(itr.nextToken());   context.write(word, one);   }   }  }    

Sample  Pig  script  without  the  UDF:  SET  pig.maxCombinedSplitSize 67108864  SET  pig.splitCombination true  A = LOAD  '/testdata/pg/*/*/*';  B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS  word;  C = FOREACH B GENERATE UPPER(word) AS  word;  D = GROUP  C BY  word;  E = FOREACH D GENERATE COUNT(C) AS  occurrences, group;  F = ORDER  E BY  occurrences DESC;  STORE F INTO  '/user/cleonardi/pg/pig-count';

30 © RedPoint Global Inc. 2015 Confidential

RedPoint Ranks #1

31 © RedPoint Global Inc. 2015 Confidential

Consistent High Rankings

32 © RedPoint Global Inc. 2015 Confidential

Data Lake Architecture for MDM

33 © RedPoint Global Inc. 2015 Confidential

Kris Tomes Solution Director at RedPoint

Demonstration Introduction

34 © RedPoint Global Inc. 2015 Confidential

Key Factors to Consider

" Traditional data architectures are challenged

" Maximize the scale & cost optimization of the Hortonworks Modern Data Architecture

" Leverage your DBAs to control development / implementation / production costs and schedules

" Smooth out your journey to a data lake

" Expedite the speed for getting business applications into production

" Insist on Any Data, Any Application, Any Environment

" Do your data quality and data integration in the cluster

35 © RedPoint Global Inc. 2015 Confidential

Thank You & Please Visit Us at www.RedPoint.net

Jamie  Keeffe  Product  Marke:ng  Manager      RedPoint  Global  Inc.  [email protected]    +1  978-­‐764-­‐3839