Upload
richard-mcdougall
View
116
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A talk on the frameworks for building big-data applications.
Citation preview
© 2009 VMware Inc. All rights reserved
Building Big Data Applications Services for Private Clouds
Richard McDougall
Chief Architect, Storage and Application Services
VMware, Inc
@richardmcdougll
2
Infrastructure, Apps and now Data…
Private Public
Build Run
Manage
Simplify Infrastructure With Cloud
Simplify App Platform Through PaaS Simplify Data
3
Trend 1/3: New Data Growing at 60% Y/Y
Source: The Information Explosion, 2009
medical(imaging,(sensors(
cad/cam,(appliances,(machine(data,(digital(movies(
digital(photos(
digital(tv(
audio(
camera(phones,(rfid(
satellite(images,(logs,(scanners,(twi7er(
Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…
4
Data Growth in the Enterprise
5
Trend 2/3: Big Data – Driven by Real-World Benefit
6
Enterprise : Early Adopter Industries and Use Cases
7
Early Adopters: Enterprise Segmentation
Verticals! Targets! Use Cases!
• Existing Hadoop Users"• Business Analysts"• Data Scientists"• LOB managers"• IT/Ops"
• Business Trend Analytics"• Revenue analytics"• CDR, call pattern analytics"• Sensor data analytics"• Log, machine data analytics"• Fraud detection"• Homeland security"• Predictive analytics"
• Financial Services"• Retail"• Telco"• Manufacturing"• Government"
8
• End users/Exec users"• Business Analysts"• PM, LOB managers"• Marketing/Sales"• Data Engineers"• Data Scientists"• IT/Operations"
• Behavioral Analytics"• Audience segmentation"• Revenue Optimization"• User activity monetization"• Inventory, price
management"• Recommendations"• Predictive analytics"
• Online Advertising"• eCommerce"• Mobile"• Social Media"• Gaming"
Verticals! Targets! Use Cases!
Early Adopters: Non-enterprise Segmentation
9
Why now? more transactions (Social/Mobile/Local)
3.7B calls/month
30B messages/month
10k card transactions/sec
1TB data/day
500 TB data/day
SoMoLo
35 check-ins/sec
13k API calls/sec
Big “traditional” companies
Size of data communications transactions
10
Trend 3/3: Value from Data Exceeds Hardware Cost
! Value from the intelligence of data analytics now outstrips the cost of hardware • Hadoop enables the use of 10x lower cost hardware • Hardware cost halving every 18mo
Big Iron: $40k/CPU
Commodity Cluster: $1k/CPU
Value
Cost
11
The Old Big Data Stack
E T L Column Oriented
Relational Database (Oracle, Teradata, DB2)
Data Visualization (Crystal, Bus O)
Extract, Transform, Load
(Informatica)
Business Intelligence
Statistics (SAS, SPSS)
Master Data Management (Oracle, SAP)
Files
SQL Databases
12
The Old Big Data Stack
E T L
Column Oriented Relational Database
(Oracle, Teradata, DB2)
Data Visualization (Crystal, Bus
O)
Extract, Transform, Load
(Informatica)
Business Intelligence
Statistics (SAS, SPSS)
Master Data Management (Oracle, SAP)
Files
SQL Databases
! Unable to handle large data volumes & diversity of data
! Iterative, brute-force and slow process
! Lack of ad-hoc data navigation across events and time
! Cumbersome ETL to “process” and DBAs to “prepare”
! Focused on structured data that is warehoused
! Web analytics solutions force real-time events into rigid schemas in DBs
13
The Journey To Big Data Analytics
All Data Faster Answers Elastic & Scalable 1 2 Data Science
Collaboration Self-Service
Agile Analytics People & Productivity Focus
Analytic Productivity Platform
Agile Process & Tools
3 Real Time Decisions New Applications Data Monetization
Predictive Enterprise Application Focus
Big Data Enabled Apps
BI As A Service Technology Focus
Analytics Engines
Cloud Infrastructure
Analytic Engines
Goal: encourage experimentation
with existing data
Goal: discover meaningful insights that
impact the business
Goal: operationalize those insights
as quickly as possible
14
1. Business analysts, LOB managers, execs • Need: out-of-the-box analytics • Designed for: self-service for end-user leveraging app
developers
2. Data engineers/analysts • Need: out-of-the-box + some customization
• Designed for: admin + operations
3. Data scientists • Need: power capabilities + heavy customization • Designed for: data scientists
4. IT, Operations • Need: out-of-the-box + some customization • Designed for: IT/admin, ops
Customer profiles
15
Distributed, Parallelization Algorithm
& programming Skills
Math and Statistical Knowledge
Business Domain and Problem
Understanding
Vertical or Horizontal Use case and Analytics Experience
Data Science &
Data Engineering
What is Data Science and Data Engineering?
16
What is Driving Big Data?
Structured
Largely Unstructured
Semi-structured
Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012
17
Today’s Big Data System:
ETL
Real Time Streams
Unstructured Data (HDFS)
Real Time Structured Database
Big SQL
Data Parallel Batch
Processing
Real-Time Processing
(s4, storm)
Analytics
18
Cloud Infrastructure
Data Platform
Private Public
Developer Frameworks
The Unified Analytics Cloud Platform
Analytics Tools
vSphere
Database/DataStore Cassandra
HawQ hBase
Impala HDFS
Data PaaS
PaaS Hadoop R Python
Madlib
Cloudfoundry
Data Meer Karmasphere
Spring
Data-Director EMC Chorus
Tableau
19
The New Big Data System
E T L
Real Time Streams
Structured and Unstructured Data (HDFS, S3)
Real Time Structured Database
Structured Data
Engine
Unstructured and Batch
Processing (Hadoop, Hive)
Real-Time Stream
Processing Data Visualization
(Excel, Tableau)
Federated Query (SQL aggregation)
Compute Storage Networking
Cloud Infrastructure
Common Query
Automated Models
Business Intelligence
20
An Example – Automated Performance Management
10M Performance
Stats/min
Stats Database
Batch Baseline
Calculation
Trigger Models
Compute Storage Networking
Cloud Infrastructure
21
Big (Data) problems: becoming the standardized stack
Google( Facebook( Yahoo( Linked(in( Cloudera( Twi7er(
Metadata& Dremel& Hive& Hive& Hive&Schedule&&&pipeline&workloads& Evenflow& Databee& Oozie& Azkaban& Oozie&
dataflow/queries& A/Sawzall& /Hive& Pig/Hive& Pig& Pig/Hive& Cascading&
MoreAstructured&data&store& Bigtable& Hbase& Hbase& Voldemort& Hbase& Cassandra&DB&data&collecGon/integraGon& MySQL&gateway& Sqoop& Sqoop&
Event&data&collecGon& Scribe&Data&Highway& KaLa?& Flume& Scribe&
Streaming&data&processing& A& A& A& A& A& A&
Batch&data&processing& Map/Reduce& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&
File&Storage& GFS& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop&
CoordinaGon& Chubby& Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper&
22
New Technologies
E T L
Real Time Streams
HDFS, Ceph, MAPR, Collosos
Real Time Structured Database
Aster, Greenplum
Etc,
Unstructured and Batch
Processing (Hadoop, Hive)
Real-Time Stream
Processing Data Visualization
(Excel, Tableau)
Query Virtualization (SQL aggregation)
Compute Storage Networking
Cloud Infrastructure
Common Query
Automated Models
Business Intelligence
Twitter Sensor Data
Mobile Events Machine Logs
S4, Storm
SPARK SHARK Gemfire hBase?
Map-Reduce
…
…
Machine Learning CETAS
23
Agenda
! Frameworks • Batch processing: Hadoop, Spark • Graph processing: Pregel, Apache Giraph • Real-time processing: Storm, S4, D-Streams • Interactive processing: Hive, Impala, Shark
! New requirements • Better network architectures, abstractions and end-to-end resource
management • Whither disk-locality and the flexibility to move data to compute
instead • Cluster/Datacenter-wide storage abstractions and services • The silo-less datacenter (multiple frameworks sharing a single
physical cluster and sharing �sticky� data)
24
Big Data Processing Patterns (batch, real-time or interactive)
Reverse Funnel (small input, large output, e.g., logfile loading)
Funnel (large input, small output, e.g., link/ad click-statistics)
Data transform (input and output sizes similar, e.g, data conversion/ translation)
Iterative, e.g, Machine learning tasks
Graph-based analyses to reason about relationships, e.g., PageRank, Ravi�s social approach to VI management
Hadoop, Hive, Impala Storm, S4, D-Streams, Shark
Spark
Pregel, Giraph
25
Batch processing frameworks (1/2)
! Apache Hadoop MapReduce (Yahoo!)
• Parallel data-processing paradigm (made popular by Google). Uses a distributed file system (HDFS) for persistence. Uses commodity h/w
• Model of operation: Mapper (read from HDFS + compute in parallel) -> Reducer (process map outputs in parallel) -> write to HDFS
• Key components: Namenode, Datanode, TaskTracker, JobTracker • Apache Zookeeper sometimes used for coordination • Weakness: Not well-suited for iterative (or graph) computations
26
Batch processing frameworks (2/2)
! Spark (UC Berkeley)
• Support for iterative computations and interactive data-mining by caching data in cluster RAM. Uses commodity machines
• Core abstraction: Resilient Distributed Datasets (RDDs) used as variables in Spark programs. RDDs include lineage data for easy recovery/reconstruction
• Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, …
Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
27
! Pregel (Google)/Apache Giraph
• Multiple instances of vertex-programs: user-defined functions running at/on each vertex
• Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank • Stateful in-memory computations. Fault-tolerance via checkpoints • Runs on commodity hardware (racks with high intra-rack bandwidth)
Graph processing frameworks
Compute Communicate B
arrier
VM2 VM1
28
Real-time processing frameworks (stream-processing) 1/2
! S4 (Yahoo!), Storm (Twitter) • Record-at-a-time processing. Checkpointing for fault-tolerance (S4)
Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf
29
Real-time processing frameworks (stream-processing) 2/2
! Discretized Streams/D-Streams (UC Berkeley) • Treat a streaming computation as a series of batch computations on
small time intervals. D-Stream = chain of RDDs • Fault-tolerance without replication or upstream backup (buffering)
Time
Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf
30
! Apache Hive (Facebook) • Open-source data warehouse built on top of Hadoop. HiveQL
queries compiled into MapReduce jobs. Expensive Where clauses = Table scans = high latency
Interactive processing frameworks 1/4
Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/
31
Interactive processing frameworks 2/4
! Interactive Processing Frameworks – Pivotal Hawk
32
Interactive processing frameworks 3/4
! Impala (Cloudera) • Inspired by Dremel (Google). Key concepts: columnar-data storage
(Trevni), aggregation trees for distributed query evaluation • Takes advantage of Hive tables. Uses memory as a cache for tables • Does not use MapReduce to answer queries (unlike Hive). • 3X - 90X faster than Hive
Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
33
Interactive processing frameworks 4/4
! Shark (UC Berkeley) • Key concepts: columnar-data storage (in-memory), Directed Acyclic
Graphs of Tasks for distributed query optimization and evaluation, dynamic mid-query replanning
• Uses Spark RDDs to store data and query processing results • SQL-interface (HiveQL compatible) • 100X faster than Hadoop, 100X faster than Hive
Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf
34
Unifying the Big Data Platform using Virtualization
! Goals • Make it fast and easy to provision new data Clusters on Demand • Allow Mixing of Workloads
• Leverage virtual machines to provide isolation (esp. for Multi-tenant) • Optimize data performance based on virtual topologies
• Make the system reliable based on virtual topologies
! Leveraging Virtualization • Elastic scale • Use high-availability to protect key services, e.g., Hadoop’s namenode/job
tracker • Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed environment
Cloud Infrastructure
Private Public
35
SQLCluster
Unifed Analytics Infrastructure
Hadoop Cluster
Private Public
Big SQL
A Unified Analytics Cloud Significantly Simplifies
Hadoop NoSQL
Decision Support Cluster
NoSQL Cluster
! Simplify • Single Hardware Infrastructure • Faster/Easier provisioning
! Optimize • Shared Resources = higher utilization • Elastic resources = faster on-demand access
36
Simplify Hetrogeneous Data Management via Data PaaS
Cloud Infrastructure
Data Platform
Developer
Analytics Tools
Databases
File-system
Big SQL
Large-Scale
NoSQL
In-Memory
Data PaaS – Common Data Management Layer
Provisioning
Management
Multi-tenancy
Data Discovery
Import/Export
Cloud Infrastructure
37
Technology: Databases and Data Stores for Big Data
File-system
Big SQL
Large-Scale
NoSQL
In-Memory
Unstructured Structured
Types of Data
Log files, machine generated data, documents, device data, etc…
Loosely typed device data, records, events, statistics, complex relations/graphs
Structured, partitionable data Structured data
Techno-logies
NAS, HDFS, Blob, S3, MAPR, etc..
Cassandra, hBase, Voldemort
Gemfire, Redis, Membase, SPARK
HawQ, Impala, Aster, …
Values
Store any data, easy to scale-out, can optimize for cost
Easy to scale-out, flexible and dynamic schema’s
High Throughput, low latency
High performance for repetitive queries. Ease of query language.
38
Cloud Infrastructure
Data Platform
Private Public
Developer Frameworks
The Unified Analytics Cloud Platform
Analytics Tools
vSphere
Database/DataStore Cassandra
Greenplum hBase
Voldemort HDFS
Data PaaS
PaaS Hadoop Python
Madlib
Cloudfoundry
Data Meer Karmasphere
Spring
Data-Director EMC Chorus
Tableau
R
39
Summary
! Revolution in Big Data is under way • Data centric applications are now critical
! Hadoop on Virtualization • Proven performance
• Cloud/Virtualization values apparent for Hadoop use
! Simplify through a Unified Analytics Cloud • One Platform for today’s and future big-data systems
• Better Utilization
• Faster deployment, elastic resources • Secure, Isolated, Multi-tenant capability for Analytics
40
References
! Twitter • @richardmcdougll
! My CTO Blog • http://communities.vmware.com/community/vmtn/cto/cloud
! Hadoop on vSphere • Talk @ Hadoop World
• Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf
! Spring Hadoop • http://blog.springsource.org/2012/02/29/introducing-spring-hadoop