Click here to load reader

Future of Data Intensive Applicaitons

  • View

  • Download

Embed Size (px)


"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.

Text of Future of Data Intensive Applicaitons

Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013 About Me Founding member of Hadoop team atYahoo! [2005-2010] Contributor to Apache Hadoop since v0.1 Built and led Grid SolutionsTeam atYahoo! [2007-2010] Parallel Programming Paradigms [1989-today] (PhD Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum) Thursday, December 12, 2013 Thursday, December 12, 2013 Kryptonite: First Hadoop Cluster AtYahoo! Thursday, December 12, 2013 M45 Thursday, December 12, 2013 OpenCirrus Thursday, December 12, 2013 Analytics Workbench Thursday, December 12, 2013 Analytics Workbench Thursday, December 12, 2013 Thursday, December 12, 2013 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed @>M,9+ =7"&EN?+ 3JJ2+ =M"0)>J2?+ [email protected]+ =3:&*0?+ M;[email protected],+ =70&E%K?+ =P0&/0I?+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./0* +"',&-'$)*./2* 3%*.* +"',&-'$)*0/0* +"',&-'$)*0/.* +"',&-'$)*0/2* 3%0* +4-$',0* 5$6"7)8$%&'&($)* 98:$#74$)* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 YARN Yet Another Resource Negotiator Resource Manager Node Managers Application Masters Specic to paradigm, e.g. MR Application master (aka JobTracker) Thursday, December 12, 2013 Beyond MapReduce Apache Giraph - BSP & Graph Processing Storm onYarn - Streaming Computation HOYA - HBase onYarn Hamster - MPI on Hadoop More to come ... Thursday, December 12, 2013 Hamster Hadoop and MPI on the same cluster OpenMPI Runtime on Hadoop YARN Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System Open MPI Provides: Process launching, Communication, I/O forwarding Thursday, December 12, 2013 Hamster Components Hamster Application Master Gang Scheduler,YARN Application Preemption Resource Isolation (lxc Containers) ORTE: Hamster Runtime Process launching,Wireup, Interconnect Thursday, December 12, 2013 Resource Manager Scheduler AMService Node Manager Node Manager Node Manager ! Proc/ Container Framework Daemon NS MPI Scheduler HNP MPI AM Proc/ Container !RM-AM AM-NM RM-NodeManagerClient Client-RM Aux Srvcs Proc/ Container Framework Daemon NS Proc/ Container ! Aux Srvcs RM- NodeManager Hamster Architecture Thursday, December 12, 2013 Hamster Scalability Sufcient for small to medium HPC workloads Job launch time gated byYARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(N) O(logN) O(logN) O(logN) Thursday, December 12, 2013 GraphLab + Hamster on Hadoop ! Thursday, December 12, 2013 About GraphLab Graph-based, High-Performance distributed computation framework Started by Prof. Carlos Guestrin in CMU in 2009 Recently founded Graphlab Inc to commercialize Thursday, December 12, 2013 GraphLab Features Topic Modeling (e.g. LDA) Graph Analytics (Pagerank,Triangle counting) Clustering (K-Means) Collaborative Filtering Linear Solvers etc... Thursday, December 12, 2013 Only Graphs are not Enough Full Data processing workow required ETL/ Postprocessing,Visualization, Data Wrangling, Serving MapReduce excels at data wrangling OLTP/NoSQL Row-Based stores excel at Serving GraphLab should co-exist with other Hadoop frameworks Thursday, December 12, 2013 CallTo Action Thursday, December 12, 2013 Prepare for Convergence HPC: Cache Coherence, Prefetching, Zero- copy, Low-contention locks Big Data: Caching, Mirroring, Sharding (various avors), relaxed consistency Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization Thursday, December 12, 2013 Convergence Resource Allocation, Scheduling, Lifecycle Management Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering Thursday, December 12, 2013 New Hardware Platforms Mellanox - Hadoop Acceleration through Network-assisted Merge RoCE - Brocade, Cisco, Extreme,Arista... ARM - Low power Hadoop servers SSD -Velobit,Violin, FusionIO, Samsung.. Niche - Compression, Encryption... Thursday, December 12, 2013 Data Cloud of Future? deploy Public Cloud Private Cloud On Premise Thursday, December 12, 2013 Questions? Thursday, December 12, 2013