View
4.308
Download
1
Category
Tags:
Preview:
Citation preview
THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD
1
Matt Aslett, Research Manager, 451 Research
Mike Olson, CEO, Cloudera
Bill Theisinger, Executive Director, Platform Data Services, YP
Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion
Introducing our Speakers
2
Aaron WiebeBillTheisinger
MikeOlson
Matt Aslett
© 2012 by The 451 Group. All rights reserved
Big Data, Total Data… Hadoop
Matt Aslett - @maslett• Research manager, data
management and analytics
Total Data• Assesses data management
approaches in an era of ‘big data’• Explores the drivers behind new
approaches to data management and analytics
• Explains the new and existing technologies used to store and process and deliver value from data
© 2012 by The 451 Group. All rights reserved
“Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety.
‘Big Data’
VelocityVelocityThe data is being The data is being produced at a rate produced at a rate that is beyond the that is beyond the performance limits performance limits of traditional of traditional systems systems
VolumeVolumeThe volume of data The volume of data is too large for is too large for traditional database traditional database software tools to software tools to cope with cope with
VarietyVarietyThe data lacks the The data lacks the structure to make it structure to make it suitable for storage suitable for storage and analysis in and analysis in traditional traditional databases and data databases and data warehouses warehouses
© 2012 by The 451 Group. All rights reserved
The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.
‘Total Data’
ExplorationExplorationThe interest in The interest in exploratory analytic exploratory analytic approaches, in approaches, in which schema is which schema is defined in response defined in response to the nature of the to the nature of the query.query.
TotalityTotalityThe desire to The desire to process and analyze process and analyze data in its entirety, data in its entirety, rather than rather than analyzing a sample analyzing a sample of data and of data and extrapolating the extrapolating the results.results.
DependencyDependencyThe reliance on The reliance on existing existing technologies and technologies and skills, and the need skills, and the need to balance to balance investment in those investment in those existing existing technologies and technologies and skills with the skills with the adoption of new adoption of new techniques. techniques.
FrequencyFrequencyThe desire to The desire to increase the rate of increase the rate of analysis in order to analysis in order to generate more generate more accurate and timely accurate and timely business business intelligence. intelligence.
© 2012 by The 451 Group. All rights reserved
A virtuous circle?
Increased use of interactive applications and data-generating machines
New commercial opportunities for analyzing previously ignored data
Increased desire to store and process all available data
More economically feasible to store and process previously ignored data
New infrastructure investments to support new data processing software
© 2012 by The 451 Group. All rights reserved
Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects
• Open source• Vendor-supported• Clusters of commodity servers• Storage of large data volumes• Structured, unstructured and
semi-structured data• Flexible, schema-on-read
processing• Complex data sets• Connectors to existing
databases, data integration and business intelligence tools
What is Apache Hadoop?
HBaseHBase
ZooKeeperZooKeeper PigPig
FlumeFlumeMahoutMahoutAvroAvro
ChukwaChukwa SqoopSqoop
WhirrWhirr
HDFSHDFS
MapReduceMapReduce
HiveHive
Hadoop CommonHadoop Common
HamaHama
© 2012 by The 451 Group. All rights reserved
Hadoop as a platform for storing data that could not previously be efficiently stored.
Hadoop as a large scale data ingestion/ETL layer that complements existing databases.
Hadoop as a platform for new exploratory analytic applications.
What is Apache Hadoop for?
Big-data analytics
Big-data integration
Big-data storage
THE EVOLUTION OF HADOOPTHE EVOLUTION OF HADOOPAnd how it’s used in the real world today
9
Mike Olson
CEO & Co-Founder, Cloudera
Fastest sort of a TB, 62secs over 1,460 nodes
Sorted a PB in 16.25hours over 3,658 nodes
©2011 Cloudera, Inc. All Rights Reserved.11
Hadoop Distributed File System (HDFS)
File Sharing & Data Protection Across Physical Servers
MapReduce
Distributed Computing Across Physical Servers
Has the Flexibility to Store and Mine Any Type of Data
Ask questions across structured and unstructured data that were previously impossible to ask or solve
Not bound by a single schema
Excels atProcessing Complex Data
Scale-out architecture divides workloads across multiple nodes
Flexible file system eliminates ETL bottlenecks
ScalesEconomically
Can be deployed on commodity hardware
Open source platform guards against vendor lock
Apache Hadoop is a platform for data storage and processing that is…
ScalableFault tolerantOpen source
CORE HADOOP COMPONENTS
12
2008CLOUDERA FOUNDED BY MIKE OLSON,AMR AWADALLAH & JEFF HAMMERBACHER
2009HADOOP
CREATOR DOUG CUTTING JOINS
CLOUDERA
2009CDH:FIRST COMMERCIAL APACHE HADOOP DISTRIBUTION
2010CLOUDERA MANAGER:
FIRST MANAGEMENT
APPLICATION FOR HADOOP
2011CLOUDERA REACHES 100 PRODUCTION CUSTOMERS
2011CLOUDERA
UNIVERSITY EXPANDS TO 140
COUNTRIES
2012CLOUDERA ENTERPRISE 4:THE STANDARD FOR HADOOP IN THE ENTERPRISE
2012CLOUDERA
CONNECT REACHES 300
PARTNERS
BEYOND…TRANSFORMING
HOW COMPANIES THINK ABOUT
DATA
CLOUDERA ENTERPRISE
4
CHANGING THE WORLDONE PETABYTE AT A TIME
13
CLOUDERA ENTERPRISE
CDH:BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED ON APACHE HADOOP – 100% OPEN SOURCE
CLOUDERA MANAGER:END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH
CLOUDERA SUPPORT:OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE LEVEL AGREEMENTS (SLAS)
PROFESSIONAL SERVICES
USE CASE DISCOVERY
NEW HADOOP DEPLOYMENT
PROOF OF CONCEPT
PRODUCTION PILOTS
PROCESS & TEAM DEVELOPMENT
DEPLOYMENT CERTIFICATION
EDUCATION
DEVELOPERS
ADMINISTRATORS
CERTIFICATION PROGRAMS
DATA SCIENTISTS
Cloudera’s software is never installed all by itself
It’s always deployed alongside mission-critical systems that represent enormous investment
Extracting value from data requires sharing it across boundaries and among systems
Goal: The right storage and the right processing in the right place at the right time
©2012 Cloudera, Inc. All Rights Reserved.14
✛ Disparate data sources✛ Disparate systems for transforming, processing
and analyzing data✛ Disparate systems for capturing and reporting
data, and for enforcing business and legislative governance requirements
All need to be connected for usability and to unlock the unique value of each
©2012 Cloudera, Inc. All Rights Reserved.15
©2011 Cloudera, Inc. All Rights Reserved.16
LogsLogs FilesFiles Web DataWeb Data Relational DatabasesRelational Databases
IDE’sIDE’s BI / AnalyticsBI / Analytics Enterprise ReportingEnterprise Reporting
Enterprise Data Warehouse
Operational Rules Engines
Management Tools
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
Cloudera Enterprise•CDH•Cloudera Manager•Technical Support
Consulting ServicesCloudera University
Web Application
Web Application
CUSTOMERS
INDUSTRYDATA PROCESSING
ADVANCED ANALYTICS
Web Clickstream Sessionization Social Network Analysis
Media Engagement Content Optimization
Telecom Mediation Network Analytics
Retail Data Factory Loyalty & Promotions
Financial Trade Reconciliation Fraud Analysis
Government Signal Intelligence (SIGINT) Entity Analysis
Biotech / Pharma Genome Mapping Sequencing Analysis
18
© 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE
ONLY)
Hadoop@YP
Sept 26, 2012William Theisinger
Executive Director, Platform Computing
Challenges
Page 20
• Increasing volume of traffic data through our distribution network
• Need for a system to support changing data complexity and detail
• Adhere to tighter SLAs
• Provide intra-day reporting
• Benefit from the intelligence trapped in our data
What we were facing
21
Legacy processing flow
Page 22
Application Log Data
ETL processing
Data Load
Data Load
Data Load
Data Warehouse
• Drop reportable events on the floor
• Loading multiple DBs
• Processing time was significant
• Reporting lag was in days, not hours
• High maintainability required
Data Layer
Hadoop Platform
Page 23
Hadoop processing flow
Page 24
Applications LWES
Data Laye
r
Data Warehouse
• All ETL processing in Hadoop
• Several systems integrate to Hadoop platform
• All Java MapReduce with some Hive for end user and dependent systems
• Reporting lag in hours, not days
• Actual reduction in maintainability needs
Data Collection
Hadoop Platform
Next Generation
Page 25
Hadoop processing flow
Page 26
Applications LWES
Data Laye
r
Data Warehouse
• Migrating some reporting to HBase
• Exposing core business KPIs via APIs
• Replacing various data marts with HBase tables/schemas
• Reducing TCO
• Alignment of core skill sets
Data Collection
Hadoop Platform
HBase Platform
Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect
Internal Use Only
The Problem
1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly).
2. Traditional systems unable to cope with both growth and access requests.
3. Total global dataset of ~100PB.
Confidential and Proprietary28 Confidential and Proprietary28
Internal Use Only
The Old Way
1. - Focus on reducing data to required data set
2. - Pipeline data flows to avoid hitting disk
3. - Scalability issues at most stages
4. - Going back to the Archive was really time consuming
Confidential and Proprietary29 Confidential and Proprietary29
ServicesFilter andSplit Streaming ETL
Streaming ETL
Event Monitoring
Data Warehouse
Complex Correlation
Alerting
Archive Storage
Internal Use Only
The Hadoop Way
1. - Archive storage moved to HDFS
2. - ETL processes converted to Hadoop (Pig+Hive)
3. - Some data warehouse functions migrating to Hadoop
Confidential and Proprietary30 Confidential and Proprietary30
ServicesFilter andSplit
Event Monitoring
Alerting
HadoopArchive Storage
ETLCorrelation
Stage 1 DWH
Data Warehouse
Internal Use Only
Real Results
1. - 90% code base reduction for ETL Tools
2. - Example Performance:
3. - Previous Ad-Hoc query would take around 4 days
- Now takes 53 minutes
- Significant capital cost reductions over previous system
Confidential and Proprietary31 Confidential and Proprietary31
Introducing our Speakers
32
Aaron WiebeBillTheisinger
MikeOlson
Matt Aslett
Recommended