Upload
milind-bhandarkar
View
895
Download
58
Embed Size (px)
DESCRIPTION
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
Future of Data Intensive Applications
Milind BhandarkarChief Scientist, Pivotal
@techmilind
Thursday, December 12, 2013
About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team at Yahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid Solutions Team at Yahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Thursday, December 12, 2013
Thursday, December 12, 2013
Kryptonite: First Hadoop Cluster At Yahoo!
Thursday, December 12, 2013
M45Thursday, December 12, 2013
OpenCirrusThursday, December 12, 2013
Analytics Workbench
Thursday, December 12, 2013
Analytics Workbench
Thursday, December 12, 2013
Thursday, December 12, 2013
70% of data generated by
customers
80% of data being stored
3% being prepared for
analysis
0.5% being analyzed
<0.5% being operationalized
Average Enterprises
The Big GapThursday, December 12, 2013
Example: Healthcare
• In last 5 years
• 3,573 studies on hospital readmissions
• 9,745 papers on comparative effectiveness
• 39,230 studies on drug interactions
• 132,241 studies on hospital mortality
• Yet, very few models operational
Thursday, December 12, 2013
PowerPoint is where Models go to Die.
- Hulya Farinas, Principal Data Scientist, Pivotal
Thursday, December 12, 2013
ModernizationThursday, December 12, 2013
Building BlocksThursday, December 12, 2013
Structured
BI/Analytics Tools Data Science Applications
Semi-structured Unstructured
High-Speed Integration Data provisioning, shared security, coordinated transformation
(Big) Data Staging Platform
Analytic Data Warehouse
In Memory Data Grid
On-Demand, Self-Service Access
Meta-data driven access control
Modern Data Architecture
Thursday, December 12, 2013
Data Fabric Requirements
• Store massive & diverse data sets economically
• Integrate and Ingest from legacy & disparate sources
• Ability to rapidly analyze massive data sets
• Control, Auditing, Manageability
• Self-Service
Thursday, December 12, 2013
Data Fabric ArchitectureThursday, December 12, 2013
Infrastructure-As-A-Service is the new
“Hardware”
Thursday, December 12, 2013
IAAS: New Hardware
• AWS, GCE, Azure
• vSphere, OpenStack
• Easy Provisioning
• Scalable, Elastic, Ubiquitous
•Needs bundling with Data & Analytics as Services
Thursday, December 12, 2013
App Fabric Requirements
• IAAS Cloud-Agnostic
• Rapid provisioning, Elasticity
•Open, No-Lock-In, Data As-A-Service
• Automation for Application Lifecycle Management
•Developer Agility : Eliminate Infrastructure Wiring
Thursday, December 12, 2013
Thursday, December 12, 2013
EcosystemThursday, December 12, 2013
Broader EcosystemThursday, December 12, 2013
Legacy App Deployment
Thursday, December 12, 2013
!"#$%&%#'()*+(,-#./0(
((12"341()*+(,-#./0(
((!.&5()*+(2!!0(
((6%'/()*+(&4"$%,4&0(
((&,2-4()*+(2!!0(7899(
.!3"2/4()*+(,-#./0(
(
Modern App Deployment
Thursday, December 12, 2013
Infrastructure One
JVM
VM
Infrastructure One
Infrastructure Two
App
Container 1
App Server
JVM
Container 2
App Server
JVM
Dev Framework Dev Framework
App Server
Configurations Manifests, Automations
Infrastructure Two
JVM
VM
Dev Framework
App Server
Configurations
App App App
Application As Unit of Deployment
Thursday, December 12, 2013
Hadoop’s Role in Data Clouds
Thursday, December 12, 2013
Trough of Disillusionment ?
Thursday, December 12, 2013
Or, Hadoop Everywhere?
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Game Changing Hadoop Economics
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop
Thursday, December 12, 2013
Storage Options
• HDFS, MapR, Quantcast QFS
• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre
• Amazon S3, EMC Atmos, OpenStack Swift
• GlusterFS, Ceph
• EMC ViPR
Thursday, December 12, 2013
SQL-on-Hadoop
• Pivotal HAWQ
• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger
• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase
•More to come...
Thursday, December 12, 2013
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
INTERACTIVE
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
ONLINE
Hadoop 1.0(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
MapReduce 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
Hadoop 2.0
(Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
!"#$%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
&'()*+,-*%!2+%.(#"*"#./%"2#*3'&'0#3#&(*
*4*$'('*5"/2#..,&01*
!"#$.%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
/0)1%!2+%.(#"*"#./%"2#*3'&'0#3#&(1*
2*3%!#6#2%7/&*#&0,*
HADOOP 2.0
456%!$'('*8/91*
!57*%!.:+1*
%89:*;<%!2'.2'$,&01*
*
456%!$'('*8/91*
!57*%!.:+1*
%89:*;<%!2'.2'$,&01*
%
&)%!-'(2;1*
)2%%$9;*'=>%?;'(:%!"#$%&''()$*+,'
*
$*;75-*<%-.*/0'
*
Thursday, December 12, 2013
!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+
35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*
9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***
:!;<3+=>&",04-%0?+
2.;@,!<;2A@+=;0B?+
7;,@!>2.C+=7D(EFG+7HGI?+
C,!J3+=C$E&"K?+
2.L>@>M,9+=7"&EN?+
3J<+>J2+=M"0)>J2?+
M.O2.@+=3:&*0?+
M;3@,+=70&E%K?+=P0&/0I?+
YARN Platform
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./.*
+"',&-'$)*0/1*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./0*
+"',&-'$)*./2*
3%*.*
+"',&-'$)*0/0*
+"',&-'$)*0/.*
+"',&-'$)*0/2*
3%0*
+4-$',0*
5$6"7)8$%&'&($)*
98:$#74$)*
YARN Architecture
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
YARN
• Yet Another Resource Negotiator
• Resource Manager
•Node Managers
• Application Masters
• Specific to paradigm, e.g. MR Application master (aka JobTracker)
Thursday, December 12, 2013
Beyond MapReduce
• Apache Giraph - BSP & Graph Processing
• Storm on Yarn - Streaming Computation
• HOYA - HBase on Yarn
• Hamster - MPI on Hadoop
•More to come ...
Thursday, December 12, 2013
Hamster
• Hadoop and MPI on the same cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System
• Open MPI Provides: Process launching, Communication, I/O forwarding
Thursday, December 12, 2013
Hamster Components
• Hamster Application Master
• Gang Scheduler, YARN Application Preemption
• Resource Isolation (lxc Containers)
•ORTE: Hamster Runtime
• Process launching, Wireup, Interconnect
Thursday, December 12, 2013
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager !
Proc/Container
Framework Daemon NS MPI
Scheduler HNP
MPI AM
Proc/Container
! RM-AM
AM-NM
RM-NodeManager Client Client-RM
Aux Srvcs
Proc/Container
Framework Daemon NS
Proc/Container
!
Aux Srvcs RM-
NodeManager
Hamster ArchitectureThursday, December 12, 2013
Hamster Scalability
• Sufficient for small to medium HPC workloads
• Job launch time gated by YARN resource scheduler
Launch WireUp Collectives Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)
Thursday, December 12, 2013
GraphLab + Hamsteron Hadoop
!
Thursday, December 12, 2013
About GraphLab
• Graph-based, High-Performance distributed computation framework
• Started by Prof. Carlos Guestrin in CMU in 2009
• Recently founded Graphlab Inc to commercialize Graphlab.org
Thursday, December 12, 2013
GraphLab Features
• Topic Modeling (e.g. LDA)
• Graph Analytics (Pagerank, Triangle counting)
• Clustering (K-Means)
• Collaborative Filtering
• Linear Solvers
• etc...
Thursday, December 12, 2013
Only Graphs are not Enough
• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop frameworks
Thursday, December 12, 2013
Call To Action
Thursday, December 12, 2013
Prepare for Convergence
• HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks
• “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency
•Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization
Thursday, December 12, 2013
Convergence
• Resource Allocation, Scheduling, Lifecycle Management
• Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs
• Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering
Thursday, December 12, 2013
New Hardware Platforms
•Mellanox - Hadoop Acceleration through Network-assisted Merge
• RoCE - Brocade, Cisco, Extreme, Arista...
• ARM - Low power Hadoop servers
• SSD - Velobit, Violin, FusionIO, Samsung..
•Niche - Compression, Encryption...
Thursday, December 12, 2013
Data Cloud of Future?
depl
oy
Public Cloud
Private Cloud
On Premise
Thursday, December 12, 2013
Questions?Thursday, December 12, 2013