BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for

BIG DATA PROCESSING

A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

TOPICS

COVEREDFundamentals of Big Data Platforms

Major Big Data Tools

1

2

SCALE UP (SMP)

Multiprocessor system where processors share resources : • Operating System (OS)• Memory• I/O devices connected using a common bus

SCALE OUT (MPP)

• Multiple processing nodes• OS• RAM• Network

+(n)

Add nodes to the clusterUpgrade components or buy bigger server each time

Scaling

Up vs. Out

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarellastarted working on Nutch

Doug Cutting adds DFS &MapReduce support to Nutch

NY Times converts 4TB of image archives over 100 EC2s

Fastest sort of a TB,62 secs over 1,460 nodesSorted a PB in 16.25 hours

over 3,658 nodes

Fastest sort of a TB, 3.5 minsover 910 nodes

Google publishes GFS & MapReduce papers

Yahoo! Hires Cutting,Hadoop spins out of Nutch

Facebooks launches Hive:SQL Support for Hadoop Hadoop Summit 2009,

750 attendees

Doug Cuttingjoins ClouderaFounded

Innovation

Timeline

Presenter

Presentation Notes

Led by Chad Cook, second presenters will likely be from local offices.

THE

FUNDAMENTALS OF HADOOP

• Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s

• Hadoop consists of:• MapReduce• Hadoop Distributed File System

(HDFS)

WHAT’S

NEW…

BASICS OF

MPP

1 bill/ sec = 400 Seconds400 bills

BASICS OF

MPP

Total = 200 Seconds



BASICS OF

MPP

Total = 100 Seconds

= 100 Seconds100 Bills 1 bill/ sec




HDFS &

MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node

Job Tracker Task Tracker Task Tracker

Name Node Data Node Data Node

Map Reduce

HDFS Cluster

1 N

BASICS OF

MAPREDUCEThe Main Node: runs the Job tracker and the name node controls the files.Each node runs two processes: Task Tracker and Data Node

Query

Result

Name Node/Job TrackerQuery

Data Nodes/Task Trackers

EXECUTION UNITS

MAPREDUCE

SOME DISTRIBUTIONS OF

APACHEHADOOP

Apache Foundation

Sandbox

Hortonworks

MAPREDUCE

PIG & HIVE

MAPREDUCE

• Java

• Write many lines of code

PIG

• Mostly used by Yahoo

• Most used for data processing

• Shares some constructs w/ SQL

• Is more Verbose• Needs a lot of training for users with

limited procedural programming background

• Offers control over the flow of data

HIVE

• Mostly used by Facebook for analytic purposes

• Used for analytics• Relatively easier for developers w/

SQL experience• Less control over optimization of data

flows compared to Pig

Not as efficient as MapReduceHigher productivity for data scientists and developers

THE

EXPLOSION OFHADOOP

17

THE

HISTORY OFSPARK

2006 2010 2013

2009 2011 20142004

MapReduce

Spark Paper

BSD Open Source

Top Level

Apache

SPARK

SHAREDLIBRARIES

SPARK

THE UNIFIED PLATFORMFOR BIG DATA

APIs for :• Scala• Java • Python• R

SparkSQL

SparkStreaming

MLlib(machinelearning)

GraphX(graph)

Spark Core

SPARK

BENEFITSPerformance

Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests).

Can be used for batch and real-time data processing.

Unified Engine

Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing.

A single application can combine all types of processing.

Developer Productivity

Easy-to-use APIs for processing large datasets.

Includes 100+ operators for transforming.

Ecosystem

Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB.

Runs on top of the Apache YARN resource manager.

ANALYTICS

CORTANA

SQL Server

Big Data Optimizations

APSSQL Server

Presenter

Presentation Notes

KEY POINT To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey Appliance: APS integrates PDW and HDInsight to operate seamlessly together in a single appliance Integrated Querying across All Data Types Using T-SQL: PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training. Enterprise-Ready Hadoop: HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User: Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel Next-generation performance at scale APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements. Scale-Out to accommodate your Growing Data: APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out Remove DW bottlenecks with MPP SQL Server: Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-Memory: Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore Concurrency that Supports High Adoption: Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads. Optimal architecture More than just a converged system, PDW has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value: APS provides the industry’s lowest DW Price/TB Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of PDW storage with up to 15x compression via updateable in-memory columnstore Value through single appliance solution Reduce hardware footprint by having PDW and HDInsight within a single appliance Remove the need for costly integration efforts Value through flexible hardware options Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta

Base Unit Scale UnitExtension Base Unit

SQL Server

APS Growth Topology

Presenter

Presentation Notes

HP example – Dell is slightly different

Azure SQL DWSQL Server

Presenter

Presentation Notes


Deployment options and hybrid solutionsSQL Server

Presenter

Presentation Notes


Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL

Uses the power of MPP to enhance query execution performance

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios

Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera

SQL ServerParallel DataWarehouseMicrosoft Azure

HDInsight

PolyBase

Microsoft HDInsight

Hortonworks for Windows and Linux

Cloudera

Connecting Islands of Data with PolyBaseResult

setSelect

…

SQL Server

Presenter

Presentation Notes

KEY POINT PolyBase is available only within the Microsoft Analytics Platform System. TALK TRACK PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level. Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-hybrid scenario options: PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouse

Use Case 3: Supply Chain Management

USE CASE: SUPPLY CHAIN MANAGEMENT

USE CASE:

SMART GRIDMANAGEMENT

USE CASES

SMART GRID

PREDICTIVEMAINTENANCE

DEMANDFORECASTING

GRIDOPTIMIZATION

THEFTPERVENTION

DEMANDRESPONSE

CUSTOMERPROFILING

USE CASES

SMART GRID

USE CASE:

REAL TIMETRAFFIC ANALYSIS

REAL TIME

TRAFFICANALYSIS

USE CASES

STREAMANALYTICS

Real-time frauddetection

Connected cars Click-stream analysis

Real-time financialportfolio alerts

Smart grid, energymanagement

CRM alerting sales to customer case

Data and identityprotection services

Real-time salestracking

ML PROBLEMS

SOLVED BYAZURE ML

Classification Regression RecommendersAnomalyDetection Clustering

INDUSTRY USE CASES

MACHINELEARNING

THE

MACHINE LEARNINGWORKFLOW

Input data

Data transformation

Split data

Score (prediction)

Evaluate Model

Define model

Train model

AZURE

DATAFACTORY

HD

INSIGHT

Orion [email protected]

Twitter: @oriongm

BIG DATA &

Advanced Analytics Roadshow

Questions?

Documents

BIG DATA PROCESSING A DEEP DIVE IN HADOOP ......SPARK BENEFITS Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for