The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451...

  • View
    4.308

  • Download
    1

  • Category

    Business

Preview:

Citation preview

THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD

1

Matt Aslett, Research Manager, 451 Research

Mike Olson, CEO, Cloudera

Bill Theisinger, Executive Director, Platform Data Services, YP

Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion

Introducing our Speakers

2

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett

© 2012 by The 451 Group. All rights reserved

Big Data, Total Data… Hadoop

Matt Aslett - @maslett• Research manager, data

management and analytics

Total Data• Assesses data management

approaches in an era of ‘big data’• Explores the drivers behind new

approaches to data management and analytics

• Explains the new and existing technologies used to store and process and deliver value from data

© 2012 by The 451 Group. All rights reserved

“Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety.

‘Big Data’

VelocityVelocityThe data is being The data is being produced at a rate produced at a rate that is beyond the that is beyond the performance limits performance limits of traditional of traditional systems systems

VolumeVolumeThe volume of data The volume of data is too large for is too large for traditional database traditional database software tools to software tools to cope with cope with

VarietyVarietyThe data lacks the The data lacks the structure to make it structure to make it suitable for storage suitable for storage and analysis in and analysis in traditional traditional databases and data databases and data warehouses warehouses

© 2012 by The 451 Group. All rights reserved

The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.

‘Total Data’

ExplorationExplorationThe interest in The interest in exploratory analytic exploratory analytic approaches, in approaches, in which schema is which schema is defined in response defined in response to the nature of the to the nature of the query.query.

TotalityTotalityThe desire to The desire to process and analyze process and analyze data in its entirety, data in its entirety, rather than rather than analyzing a sample analyzing a sample of data and of data and extrapolating the extrapolating the results.results.

DependencyDependencyThe reliance on The reliance on existing existing technologies and technologies and skills, and the need skills, and the need to balance to balance investment in those investment in those existing existing technologies and technologies and skills with the skills with the adoption of new adoption of new techniques. techniques.

FrequencyFrequencyThe desire to The desire to increase the rate of increase the rate of analysis in order to analysis in order to generate more generate more accurate and timely accurate and timely business business intelligence. intelligence.

© 2012 by The 451 Group. All rights reserved

A virtuous circle?

Increased use of interactive applications and data-generating machines

New commercial opportunities for analyzing previously ignored data

Increased desire to store and process all available data

More economically feasible to store and process previously ignored data

New infrastructure investments to support new data processing software

© 2012 by The 451 Group. All rights reserved

Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects

• Open source• Vendor-supported• Clusters of commodity servers• Storage of large data volumes• Structured, unstructured and

semi-structured data• Flexible, schema-on-read

processing• Complex data sets• Connectors to existing

databases, data integration and business intelligence tools

What is Apache Hadoop?

HBaseHBase

ZooKeeperZooKeeper PigPig

FlumeFlumeMahoutMahoutAvroAvro

ChukwaChukwa SqoopSqoop

WhirrWhirr

HDFSHDFS

MapReduceMapReduce

HiveHive

Hadoop CommonHadoop Common

HamaHama

© 2012 by The 451 Group. All rights reserved

Hadoop as a platform for storing data that could not previously be efficiently stored.

Hadoop as a large scale data ingestion/ETL layer that complements existing databases.

Hadoop as a platform for new exploratory analytic applications.

What is Apache Hadoop for?

Big-data analytics

Big-data integration

Big-data storage

THE EVOLUTION OF HADOOPTHE EVOLUTION OF HADOOPAnd how it’s used in the real world today

9

Mike Olson

CEO & Co-Founder, Cloudera

Fastest sort of a TB, 62secs over 1,460 nodes

Sorted a PB in 16.25hours over 3,658 nodes

©2011 Cloudera, Inc. All Rights Reserved.11

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

MapReduce

Distributed Computing Across Physical Servers

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Apache Hadoop is a platform for data storage and processing that is…

ScalableFault tolerantOpen source

CORE HADOOP COMPONENTS

12

2008CLOUDERA FOUNDED BY MIKE OLSON,AMR AWADALLAH & JEFF HAMMERBACHER

2009HADOOP

CREATOR DOUG CUTTING JOINS

CLOUDERA

2009CDH:FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION

2010CLOUDERA MANAGER:

FIRST MANAGEMENT

APPLICATION FOR HADOOP

2011CLOUDERA REACHES 100 PRODUCTION CUSTOMERS

2011CLOUDERA

UNIVERSITY EXPANDS TO 140

COUNTRIES

2012CLOUDERA ENTERPRISE 4:THE STANDARD FOR HADOOP IN THE ENTERPRISE

2012CLOUDERA

CONNECT REACHES 300

PARTNERS

BEYOND…TRANSFORMING

HOW COMPANIES THINK ABOUT

DATA

CLOUDERA ENTERPRISE

4

CHANGING THE WORLDONE PETABYTE AT A TIME

13

CLOUDERA ENTERPRISE

CDH:BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED ON APACHE HADOOP – 100% OPEN SOURCE

CLOUDERA MANAGER:END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH

CLOUDERA SUPPORT:OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE LEVEL AGREEMENTS (SLAS)

PROFESSIONAL SERVICES

USE CASE DISCOVERY

NEW HADOOP DEPLOYMENT

PROOF OF CONCEPT

PRODUCTION PILOTS

PROCESS & TEAM DEVELOPMENT

DEPLOYMENT CERTIFICATION

EDUCATION

DEVELOPERS

ADMINISTRATORS

CERTIFICATION PROGRAMS

DATA SCIENTISTS

Cloudera’s software is never installed all by itself

It’s always deployed alongside mission-critical systems that represent enormous investment

Extracting value from data requires sharing it across boundaries and among systems

Goal: The right storage and the right processing in the right place at the right time

©2012 Cloudera, Inc. All Rights Reserved.14

✛ Disparate data sources✛ Disparate systems for transforming, processing

and analyzing data✛ Disparate systems for capturing and reporting

data, and for enforcing business and legislative governance requirements

All need to be connected for usability and to unlock the unique value of each

©2012 Cloudera, Inc. All Rights Reserved.15

©2011 Cloudera, Inc. All Rights Reserved.16

LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases

IDE’sIDE’s BI / AnalyticsBI / Analytics Enterprise ReportingEnterprise Reporting

Enterprise Data Warehouse

Operational Rules Engines

Management Tools

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Cloudera Enterprise•CDH•Cloudera Manager•Technical Support

Consulting ServicesCloudera University

Web Application

Web Application

CUSTOMERS

INDUSTRYDATA PROCESSING

ADVANCED ANALYTICS

Web Clickstream Sessionization Social Network Analysis

Media Engagement Content Optimization

Telecom Mediation Network Analytics

Retail Data Factory Loyalty & Promotions

Financial Trade Reconciliation Fraud Analysis

Government Signal Intelligence (SIGINT) Entity Analysis

Biotech / Pharma Genome Mapping Sequencing Analysis

18

© 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE

ONLY)

Hadoop@YP

Sept 26, 2012William Theisinger

Executive Director, Platform Computing

Challenges

Page 20

• Increasing volume of traffic data through our distribution network

• Need for a system to support changing data complexity and detail

• Adhere to tighter SLAs

• Provide intra-day reporting

• Benefit from the intelligence trapped in our data

What we were facing

21

Legacy processing flow

Page 22

Application Log Data

ETL processing

Data Load

Data Load

Data Load

Data Warehouse

• Drop reportable events on the floor

• Loading multiple DBs

• Processing time was significant

• Reporting lag was in days, not hours

• High maintainability required

Data Layer

Hadoop Platform

Page 23

Hadoop processing flow

Page 24

Applications LWES

Data Laye

r

Data Warehouse

• All ETL processing in Hadoop

• Several systems integrate to Hadoop platform

• All Java MapReduce with some Hive for end user and dependent systems

• Reporting lag in hours, not days

• Actual reduction in maintainability needs

Data Collection

Hadoop Platform

Next Generation

Page 25

Hadoop processing flow

Page 26

Applications LWES

Data Laye

r

Data Warehouse

• Migrating some reporting to HBase

• Exposing core business KPIs via APIs

• Replacing various data marts with HBase tables/schemas

• Reducing TCO

• Alignment of core skill sets

Data Collection

Hadoop Platform

HBase Platform

Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect

Internal Use Only

The Problem

1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly).

2. Traditional systems unable to cope with both growth and access requests.

3. Total global dataset of ~100PB.

Confidential and Proprietary28 Confidential and Proprietary28

Internal Use Only

The Old Way

1. - Focus on reducing data to required data set

2. - Pipeline data flows to avoid hitting disk

3. - Scalability issues at most stages

4. - Going back to the Archive was really time consuming

Confidential and Proprietary29 Confidential and Proprietary29

ServicesFilter andSplit Streaming ETL

Streaming ETL

Event Monitoring

Data Warehouse

Complex Correlation

Alerting

Archive Storage

Internal Use Only

The Hadoop Way

1. - Archive storage moved to HDFS

2. - ETL processes converted to Hadoop (Pig+Hive)

3. - Some data warehouse functions migrating to Hadoop

Confidential and Proprietary30 Confidential and Proprietary30

ServicesFilter andSplit

Event Monitoring

Alerting

HadoopArchive Storage

ETLCorrelation

Stage 1 DWH

Data Warehouse

Internal Use Only

Real Results

1. - 90% code base reduction for ETL Tools

2. - Example Performance:

3. - Previous Ad-Hoc query would take around 4 days

- Now takes 53 minutes

- Significant capital cost reductions over previous system

Confidential and Proprietary31 Confidential and Proprietary31

Introducing our Speakers

32

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett

Recommended