The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451...

THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD

Matt Aslett, Research Manager, 451 Research

Mike Olson, CEO, Cloudera

Bill Theisinger, Executive Director, Platform Data Services, YP

Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion

Introducing our Speakers

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett

Big Data, Total Data… Hadoop

Matt Aslett - @maslett• Research manager, data

management and analytics

Total Data• Assesses data management

approaches in an era of ‘big data’• Explores the drivers behind new

approaches to data management and analytics

• Explains the new and existing technologies used to store and process and deliver value from data

“Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety.

‘Big Data’

VelocityVelocityThe data is being The data is being produced at a rate produced at a rate that is beyond the that is beyond the performance limits performance limits of traditional of traditional systems systems

VolumeVolumeThe volume of data The volume of data is too large for is too large for traditional database traditional database software tools to software tools to cope with cope with

VarietyVarietyThe data lacks the The data lacks the structure to make it structure to make it suitable for storage suitable for storage and analysis in and analysis in traditional traditional databases and data databases and data warehouses warehouses

The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.

‘Total Data’

ExplorationExplorationThe interest in The interest in exploratory analytic exploratory analytic approaches, in approaches, in which schema is which schema is defined in response defined in response to the nature of the to the nature of the query.query.

TotalityTotalityThe desire to The desire to process and analyze process and analyze data in its entirety, data in its entirety, rather than rather than analyzing a sample analyzing a sample of data and of data and extrapolating the extrapolating the results.results.

DependencyDependencyThe reliance on The reliance on existing existing technologies and technologies and skills, and the need skills, and the need to balance to balance investment in those investment in those existing existing technologies and technologies and skills with the skills with the adoption of new adoption of new techniques. techniques.

FrequencyFrequencyThe desire to The desire to increase the rate of increase the rate of analysis in order to analysis in order to generate more generate more accurate and timely accurate and timely business business intelligence. intelligence.

A virtuous circle?

Increased use of interactive applications and data-generating machines

New commercial opportunities for analyzing previously ignored data

Increased desire to store and process all available data

More economically feasible to store and process previously ignored data

New infrastructure investments to support new data processing software

Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects

• Open source• Vendor-supported• Clusters of commodity servers• Storage of large data volumes• Structured, unstructured and

semi-structured data• Flexible, schema-on-read

processing• Complex data sets• Connectors to existing

databases, data integration and business intelligence tools

What is Apache Hadoop?

HBaseHBase

ZooKeeperZooKeeper PigPig

FlumeFlumeMahoutMahoutAvroAvro

ChukwaChukwa SqoopSqoop

WhirrWhirr

HDFSHDFS

MapReduceMapReduce

HiveHive

Hadoop CommonHadoop Common

HamaHama

Hadoop as a platform for storing data that could not previously be efficiently stored.

Hadoop as a large scale data ingestion/ETL layer that complements existing databases.

Hadoop as a platform for new exploratory analytic applications.

What is Apache Hadoop for?

Big-data analytics

Big-data integration

Big-data storage

THE EVOLUTION OF HADOOPTHE EVOLUTION OF HADOOPAnd how it’s used in the real world today

Mike Olson

CEO & Co-Founder, Cloudera

Fastest sort of a TB, 62secs over 1,460 nodes

Sorted a PB in 16.25hours over 3,658 nodes

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

MapReduce

Distributed Computing Across Physical Servers

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on commodity hardware

Open source platform guards against vendor lock

Apache Hadoop is a platform for data storage and processing that is…

ScalableFault tolerantOpen source

CORE HADOOP COMPONENTS

2008CLOUDERA FOUNDED BY MIKE OLSON,AMR AWADALLAH & JEFF HAMMERBACHER

2009HADOOP

CREATOR DOUG CUTTING JOINS

CLOUDERA

2009CDH:FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION

2010CLOUDERA MANAGER:

FIRST MANAGEMENT

APPLICATION FOR HADOOP

2011CLOUDERA REACHES 100 PRODUCTION CUSTOMERS

2011CLOUDERA

UNIVERSITY EXPANDS TO 140

COUNTRIES

2012CLOUDERA ENTERPRISE 4:THE STANDARD FOR HADOOP IN THE ENTERPRISE

2012CLOUDERA

CONNECT REACHES 300

PARTNERS

BEYOND…TRANSFORMING

HOW COMPANIES THINK ABOUT

CLOUDERA ENTERPRISE

CHANGING THE WORLDONE PETABYTE AT A TIME

CLOUDERA ENTERPRISE

CDH:BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED ON APACHE HADOOP – 100% OPEN SOURCE

CLOUDERA MANAGER:END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH

CLOUDERA SUPPORT:OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE LEVEL AGREEMENTS (SLAS)

PROFESSIONAL SERVICES

USE CASE DISCOVERY

NEW HADOOP DEPLOYMENT

PROOF OF CONCEPT

PRODUCTION PILOTS

PROCESS & TEAM DEVELOPMENT

DEPLOYMENT CERTIFICATION

EDUCATION

DEVELOPERS

ADMINISTRATORS

CERTIFICATION PROGRAMS

DATA SCIENTISTS

Cloudera’s software is never installed all by itself

It’s always deployed alongside mission-critical systems that represent enormous investment

Extracting value from data requires sharing it across boundaries and among systems

Goal: The right storage and the right processing in the right place at the right time

✛ Disparate data sources✛ Disparate systems for transforming, processing

and analyzing data✛ Disparate systems for capturing and reporting

data, and for enforcing business and legislative governance requirements

All need to be connected for usability and to unlock the unique value of each

LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases

IDE’sIDE’s BI / AnalyticsBI / Analytics Enterprise ReportingEnterprise Reporting

Enterprise Data Warehouse

Operational Rules Engines

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Cloudera Enterprise•CDH•Cloudera Manager•Technical Support

Consulting ServicesCloudera University

Web Application

CUSTOMERS

INDUSTRYDATA PROCESSING

ADVANCED ANALYTICS

Web Clickstream Sessionization Social Network Analysis

Media Engagement Content Optimization

Telecom Mediation Network Analytics

Retail Data Factory Loyalty & Promotions

Financial Trade Reconciliation Fraud Analysis

Government Signal Intelligence (SIGINT) Entity Analysis

Biotech / Pharma Genome Mapping Sequencing Analysis

© 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE

Hadoop@YP

Sept 26, 2012William Theisinger

Executive Director, Platform Computing

Challenges

• Increasing volume of traffic data through our distribution network

• Need for a system to support changing data complexity and detail

• Adhere to tighter SLAs

• Provide intra-day reporting

• Benefit from the intelligence trapped in our data

What we were facing

Legacy processing flow

Application Log Data

ETL processing

Data Load

Data Warehouse

• Drop reportable events on the floor

• Loading multiple DBs

• Processing time was significant

• Reporting lag was in days, not hours

• High maintainability required

Data Layer

Hadoop Platform

Hadoop processing flow

Applications LWES

Data Laye

Data Warehouse

• All ETL processing in Hadoop

• Several systems integrate to Hadoop platform

• All Java MapReduce with some Hive for end user and dependent systems

• Reporting lag in hours, not days

• Actual reduction in maintainability needs

Data Collection

Hadoop Platform

Next Generation

Hadoop processing flow

Applications LWES

Data Laye

Data Warehouse

• Migrating some reporting to HBase

• Exposing core business KPIs via APIs

• Replacing various data marts with HBase tables/schemas

• Reducing TCO

• Alignment of core skill sets

Data Collection

Hadoop Platform

HBase Platform

Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect

Internal Use Only

The Problem

1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly).

2. Traditional systems unable to cope with both growth and access requests.

3. Total global dataset of ~100PB.

Confidential and Proprietary28 Confidential and Proprietary28

Internal Use Only

The Old Way

1. - Focus on reducing data to required data set

2. - Pipeline data flows to avoid hitting disk

3. - Scalability issues at most stages

4. - Going back to the Archive was really time consuming

ServicesFilter andSplit Streaming ETL

Streaming ETL

Event Monitoring

Data Warehouse

Complex Correlation

Alerting

Archive Storage

Internal Use Only

The Hadoop Way

1. - Archive storage moved to HDFS

2. - ETL processes converted to Hadoop (Pig+Hive)

3. - Some data warehouse functions migrating to Hadoop

ServicesFilter andSplit

Event Monitoring

Alerting

HadoopArchive Storage

ETLCorrelation

Stage 1 DWH

Data Warehouse

Internal Use Only

Real Results

1. - 90% code base reduction for ETL Tools

2. - Example Performance:

3. - Previous Ad-Hoc query would take around 4 days

- Now takes 53 minutes

- Significant capital cost reductions over previous system

Introducing our Speakers

Aaron WiebeBillTheisinger

MikeOlson

Matt Aslett

The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451...

Business

Dell | Cloudera Apache Hadoop Solution Reference Architecture

Deploying Cloudera CDH (Cloudera Distribution Including ... · 5 LAB GUIDE | Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000

Hadoop Distributions: Evaluating Cloudera, Hortonworks ... · Hadoop Distributions: Evaluating Cloudera, Hortonworks, and MapR in Micro-benchmarks and Real-world Applications Vladimir

Cloudera Certified Administrator for Apache Hadoop (CCAH)

Installation du framework Hadoop (distribution Cloudera ...Ricco Rakotomalala Tutoriels Tanagra - 1 Installation du framework Hadoop (distribution Cloudera) Installation et

Die 10 wichtigsten Big Data-Technologien · Hadoop - Ein bewährtes Konzept 4 2. Cloudera – Hadoop für Unternehmen 4 3. Apache Hive - Das Data Warehouse für Hadoop 5 4. Cloudera

Reference Architecture: Cloudera Distribution for Hadoop (CDH)

Big data processing using Hadoop with Cloudera Quickstart

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Hadoop Workshop using Cloudera on Amazon EC2

A beginners guide to Cloudera Hadoop

Cloudera Impala: A Modern SQL Engine for Hadoop

“Cloudera Hadoop” - Data Mining Trenddataminingtrend.com/2014/wp-content/uploads/2015/04/cloudera_hadoop.pdf · Cloudera Hadoop ก ตต ร กษ ม วงม งส ข

Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)

Cloudera Distributed Hadoop (CDH) Installation and ...mwang2/projects/CDH_installConfig1_13m.pdf · 1 Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera-Intel-Cisco Hadoop Benchmark TOI (External) … · Cloudera Distribution of Hadoop (known as CDH) is a popular enterprise-grade, hardened distribution of Apache Hadoop and

INSTALLATION: SAS UNIVERSITY EDITION · CLOUDERA HADOOP AND SPARK CERTIFICATION (AVAILABLE) x CCA 131 : Hadoop Administrator x CCA -175 Cloudera® (Hadoop and Spark Developer) x CCP:DE575

Dell Cloudera Apache Hadoop Soution Reference Architecture

Cloudera Manager 5 (hadoop運用) #cwt2013

Oracle DBA & Developer Days 2014 データベース...•Cloudera Distribution of Apache Hadoop (CDH) •Cloudera Manager •Cloudera Impala, Search, Navigator, HBase & BDR •Oracle