41
BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BIG DATA PROCESSING

A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Page 2: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

TOPICS

COVEREDFundamentals of Big Data Platforms

Major Big Data Tools

1

2

Page 3: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SCALE UP (SMP)

Multiprocessor system where processors share resources : • Operating System (OS)• Memory• I/O devices connected using a common bus

SCALE OUT (MPP)

• Multiple processing nodes• OS• RAM• Network

+(n)

Add nodes to the clusterUpgrade components or buy bigger server each time

Scaling

Up vs. Out

Page 4: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarellastarted working on Nutch

Doug Cutting adds DFS &MapReduce support to Nutch

NY Times converts 4TB of image archives over 100 EC2s

Fastest sort of a TB,62 secs over 1,460 nodesSorted a PB in 16.25 hours

over 3,658 nodes

Fastest sort of a TB, 3.5 minsover 910 nodes

Google publishes GFS & MapReduce papers

Yahoo! Hires Cutting,Hadoop spins out of Nutch

Facebooks launches Hive:SQL Support for Hadoop Hadoop Summit 2009,

750 attendees

Doug Cuttingjoins ClouderaFounded

Innovation

Timeline

Presenter
Presentation Notes
Led by Chad Cook, second presenters will likely be from local offices.
Page 5: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

THE

FUNDAMENTALS OF HADOOP

• Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s

• Hadoop consists of:• MapReduce• Hadoop Distributed File System

(HDFS)

Page 6: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

WHAT’S

NEW…

Page 7: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BASICS OF

MPP

1 bill/ sec = 400 Seconds400 bills

Page 8: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BASICS OF

MPP

Total = 200 Seconds

1 bill/ sec = 200 Seconds200 bills

1 bill/ sec = 200 Seconds200 bills

Page 9: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BASICS OF

MPP

Total = 100 Seconds

= 100 Seconds100 Bills 1 bill/ sec

= 100 Seconds100 Bills 1 bill/ sec

= 100 Seconds100 Bills 1 bill/ sec

= 100 Seconds100 Bills 1 bill/ sec

Page 10: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

HDFS &

MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node

Job Tracker Task Tracker Task Tracker

Name Node Data Node Data Node

Map Reduce

HDFS Cluster

1 N

Page 11: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BASICS OF

MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node

Query

Result

Name Node/Job TrackerQuery

Data Nodes/Task Trackers

Page 12: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

EXECUTION UNITS

MAPREDUCE

Page 13: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SOME DISTRIBUTIONS OF

APACHEHADOOP

Apache Foundation

Page 14: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Sandbox

Hortonworks

Page 15: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

MAPREDUCE

PIG & HIVE

MAPREDUCE

• Java

• Write many lines of code

PIG

• Mostly used by Yahoo

• Most used for data processing

• Shares some constructs w/ SQL

• Is more Verbose• Needs a lot of training for users with

limited procedural programming background

• Offers control over the flow of data

HIVE

• Mostly used by Facebook for analytic purposes

• Used for analytics• Relatively easier for developers w/

SQL experience• Less control over optimization of data

flows compared to Pig

Not as efficient as MapReduceHigher productivity for data scientists and developers

Page 16: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

THE

EXPLOSION OFHADOOP

Page 17: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

17

THE

HISTORY OFSPARK

2006 2010 2013

2009 2011 20142004

MapReduce

Spark Paper

BSD Open Source

Top Level

Apache

Page 18: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SPARK

SHAREDLIBRARIES

Page 19: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SPARK

THE UNIFIED PLATFORMFOR BIG DATA

APIs for :• Scala• Java • Python• R

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph)

Spark Core

Page 20: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SPARK

BENEFITSPerformance

Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests).

Can be used for batch and real-time data processing.

Unified Engine

Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing.

A single application can combine all types of processing.

Developer Productivity

Easy-to-use APIs for processing large datasets.

Includes 100+ operators for transforming.

Ecosystem

Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB.

Runs on top of the Apache YARN resource manager.

Page 21: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

ANALYTICS

CORTANA

Page 22: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

SQL Server

Big Data Optimizations

Page 23: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

APSSQL Server

Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.   Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Page 24: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Base Unit Scale UnitExtension Base Unit

SQL Server

APS Growth Topology

Presenter
Presentation Notes
HP example – Dell is slightly different
Page 25: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Azure SQL DWSQL Server

Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.   Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Page 26: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Deployment options and hybrid solutionsSQL Server

Presenter
Presentation Notes
KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.   Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
Page 27: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL

Uses the power of MPP to enhance query execution performance

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios

Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera

SQL ServerParallel DataWarehouseMicrosoft Azure

HDInsight

PolyBase

Microsoft HDInsight

Hortonworks for Windows and Linux

Cloudera

Connecting Islands of Data with PolyBaseResult

setSelect

SQL Server

Presenter
Presentation Notes
KEY POINT PolyBase is available only within the Microsoft Analytics Platform System. TALK TRACK PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level. Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-hybrid scenario options: PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouse
Page 28: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Use Case 3: Supply Chain Management

USE CASE: SUPPLY CHAIN MANAGEMENT

Page 29: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

USE CASE:

SMART GRIDMANAGEMENT

Page 30: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

USE CASES

SMART GRID

PREDICTIVEMAINTENANCE

DEMANDFORECASTING

GRIDOPTIMIZATION

THEFTPERVENTION

DEMANDRESPONSE

CUSTOMERPROFILING

Page 31: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for
Page 32: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

USE CASES

SMART GRID

Page 33: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

USE CASE:

REAL TIMETRAFFIC ANALYSIS

Page 34: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

REAL TIME

TRAFFICANALYSIS

Page 35: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

USE CASES

STREAMANALYTICS

Real-time frauddetection

Connected cars Click-stream analysis

Real-time financialportfolio alerts

Smart grid, energymanagement

CRM alerting sales to customer case

Data and identityprotection services

Real-time salestracking

Page 36: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

ML PROBLEMS

SOLVED BYAZURE ML

Classification Regression RecommendersAnomalyDetection Clustering

Page 37: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

INDUSTRY USE CASES

MACHINELEARNING

Page 38: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

THE

MACHINE LEARNINGWORKFLOW

Input data

Data transformation

Split data

Score (prediction)

Evaluate Model

Define model

Train model

Page 39: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

AZURE

DATAFACTORY

Page 40: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

HD

INSIGHT

Page 41: BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

Orion [email protected]

Twitter: @oriongm

BIG DATA &

Advanced Analytics Roadshow

Questions?