Download pptx - Modern Data Warehousing with the Microsoft Analytics Platform System

Microsoft Analytics Platform System (APS)Modern Data Warehousing

James SerraBig Data EvangelistMicrosoft

Agenda• Traditional data warehouse & modern data warehouse• APS architecture• Hadoop & PolyBase• Performance and scale• Appliance benefits• Summarize/questions

The traditional data warehouse

5

… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012”

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Will your current solution handle future needs?

How to “break” the traditional data warehouse

10

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Increasing data volumes

1

Real-time Performance/Data

2

Non-Relational Data

Devices

Web Sensors

Social

New data sources & types

3Cloud-born data

4

(IoT)

INFRASTRUCTURE

DATA MANAGEMENT & PROCESSING

DATA ENRICHMENT AND FEDERATED QUERY

BI & ANALYTICS

Self-service CollaborationCorporate PredictiveMobile

Extract, transform, load

Single query model Data quality Master data

management

Non-relationalRelational Analytical Streaming Internal & External

Data sources

OLTP ERP CRM LOB

Non-relational data

Devices

Web Sensors

Social

Modern data warehouse defined

Are you using or going to use “Big Data” and/or “Hadoop”

No or limited access to detailed data; can only

surface reports and cannot ask ad-hoc

questions.

Slow data loading performance cannot

keep up with the need for data from

transactional systems for intraday reporting.

MOLAP cube processing and data refresh take

too long.

Slow query performance with need

for constant tuning, especially with SAN

storage.

High cost of SAN storage chargeback.

Do you have any of these pain points?

Keep legacy investment

Buy new tier one hardware appliance

Acquire big data solution (Hadoop)

Acquire business intelligence solution

Roadblocks to evolving to a modern data warehouse

Limitedscalability & ability to handle new data

types

Significant training & still

siloed

High acquisition/ migrationcosts & no

Hadoop

Complex with low adoption

Solution and issue with that solution

Introducing the Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance

Next-generation performance at scale

Enterprise-ready big data

Engineered foroptimal value

• Relational and non-relational data in a single appliance

• Or, integrate relational data with non-relational data in an external Hadoop cluster on premise or data stored in the Cloud (hot, warm, cold)

• Enterprise-ready Hadoop

• Integrated querying across Hadoop and APS using T-SQL (PolyBase)

• Direct integration with Microsoft BI tools such as Power BI

• Near real-time performance with In-Memory

• Scale-out to accommodate your growing data or to increase performance (2-nodes to 56-nodes)

• Remove SMP DW bottlenecks with MPP SQL Server• No rip and replace when more

performance needed• No performance tuning

required

• Concurrency that fuels rapid adoption

• Industry’s lowest DW price/TB

• Value through a single appliance solution

• Value with flexible hardware options using commodity hardware

• Free up space on SAN (cost averages 10k per TB)

Hardware appliance vendor offerings

Hardware and software engineered togetherThe ease of an appliance

Co-engineered with HP, Dell, and Quanta best practices

Leading performance with commodity hardware

Pre-configured, built, and tuned software and hardware

Integrated support plan with a single Microsoft contact

PDW

HDInsight

PolyBase

Social and web analytics

Live data feeds

Advanced analytics

APS History• DatAllegro started in 2003• Microsoft acquires DatAllegro in September 2008• PDW released in December 2010 (version 1)• Version 2 made available in March, 2013 (PolyBase introduced)• AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform

System (APS). It still includes the PDW region as well as a new HDInsights/Hadoop region• AU2 was released in July 2014• AU3 released in October 2014

There will be AU updates every 3-4 months.

NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution.

Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword box and search the results, then enter "analytics platform system“ (new name)

https://customers.microsoft.com/

Parallelism

• Uses many separate CPUs running in parallel to execute a single program

• Shared Nothing: Each CPU has its own memory and disk (scale-out)

• Segments communicate using high-speed network between nodes

MPP - Massively Parallel

Processing

• Multiple CPUs used to complete individual processes simultaneously

• All CPUs share the same memory, disks, and network controllers (scale-up)

• All SQL Server implementations up until now have been SMP• Mostly, the solution is housed on a shared SAN

SMP - Symmetric

Multiprocessing

APS Logical Architecture (overview)“Compute” node Balanced

storageSQL

“Compute” node Balanced storage

SQL


SQL


SQL

DMS

DMS

DMS

DMS

Compute Node – the “worker bee” of APS• Runs SQL Server 2014 APS • Contains a “slice” of each database• CPU is saturated by storage

Control Node – the “brains” of the APS• Also runs SQL Server 2014 APS • Holds a “shell” copy of each database

• Metadata, statistics, etc• The “public face” of the appliance

Data Movement Services (DMS)• Part of the “secret sauce” of APS• Moves data around as needed• Enables parallel operations among the

compute nodes (queries, loads, etc)

“Control” nodeSQL

DMS

APS Logical Architecture (overview)


SQL“Control” nodeSQL


SQL


SQL


SQL

DMS

DMS

DMS

DMS

DMS

1) User connects to the appliance (control node) and submits query

2) Control node query processor determines best *parallel* query plan

3) DMS distributes sub-queries to each compute node

4) Each compute node executes query on its subset of data

5) Each compute node returns a subset of the response to the control node

6) If necessary, control node does any final aggregation/computation

7) Control node returns results to userQueries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger

APS Data Layout Options“Compute” node Balanced

storageSQL

Balanced storage

Balanced storage

Balanced storage

“Compute” nodeSQL



DMS

DMS

DMS

DMS

Time DimDate Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Store DimStore Dim ID

Store NameStore MgrStore Size

Product DimProd Dim ID

Prod CategoryProd Sub CatProd Desc

Customer Dim

Cust Dim IDCust NameCust AddrCust PhoneCust Email

Sales FactDate Dim IDStore Dim IDProd Dim IDCust Dim IDQty SoldDollars Sold

TD

PD

SD

CD

TD

PD

SD

CD

TD

PD

SD

CD

TD

PD

SD

CD

Sale

s Fac

t

Replicated

Table copied to each compute node

DistributedTable spread across compute nodes based on “hash”

Star Schema

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

DATA DISTRIBUTION CREATE TABLE FactSales

(ProductKey INT NOT NULL ,OrderDateKey INT NOT NULL ,DueDateKey INT NOT NULL ,ShipDateKey INT NOT NULL ,ResellerKey INT NOT NULL ,EmployeeKey INT NOT NULL ,PromotionKey INT NOT NULL ,CurrencyKey INT NOT NULL ,SalesTerritoryKey INT NOT NULL ,SalesOrderNumber VARCHAR(20) NOT NULL,

) WITH (

DISTRIBUTION = HASH(ProductKey),

CLUSTERED INDEX(OrderDateKey) ,

PARTITION(OrderDateKey RANGE RIGHT FOR

VALUES ( 20010601, 20010901,

) ) );

Control Node

…Compute Node 1

Compute Node 2

Compute Node X

Send Create Table SQL to each compute nodeCreate Table FactSales_ACreate Table FactSales_BCreate Table FactSales_C……Create Table FactSales_H

FactSalesA

FactSalesB

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

FactSalesA

FactSalesB

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

FactSalesA

FactSale B

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

Create table metadata on Control Node

APS – Balanced across servers and within

41

Largest Table 600,000,000,000

Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000

In each server randomly distributed to 8 tables (so 320 total tables)

1,875,000,000

Each partition – 2 years data partitioned by week (benefiting queries by date)

18,028,846As an end user or DBA you think about 1 table: LineItem.

“Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables.

“Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables.

You don’t care or need to know that there are actually 320 tables representing your 1 logical table.

CCI can add further performance via segment elimination.

InfinibandInfinibandEthernetEthernet

Control NodeFailover Node

Microsoft Storage Spaces 1

Compute Node 1Compute Node 2







CustomerUse

Base Unit (6U):• Redundant Infiniband• Redundant Ethernet• Mgmt & Control (Active)• Rack Failover Node (Passive)

Base Unit (7U):• 2 HP 1U Servers

• (16 Cores/Ea. Total: 32)• Microsoft Storage Spaces 5U

• 1TB Drives• User Data Capacity: 75TB

Scale Unit (7U):• 2 HP 1U Servers



¼ Rack15TB

(Uncompressed)

1/2 Rack30TB

(Uncompressed)

Customer Space (8U)• ETL Servers (Landing zone)• Backup Servers• Passive Unit (Additional spares)







Full Rack60TB (Uncom

pressed)


Failover Node









CustomerUse

Extension Base Unit (5U):• Redundant Infiniband• Redundant Ethernet• Rack Failover Node (Passive)

Extension Base Unit (7U):• 2 HP 1U Servers






1¼ Rack

75.5TB (Uncom

pressed)

Customer Space (9U)• ETL Servers• Backup Servers• Passive Unit (Additional spares)







3 Rack181.2TB (Uncom

pressed)

1 1/2 Rack90.6TB

(Uncompressed)

2 Rack120.8TB (Uncom

pressed)


Failover Node









CustomerUse

Extension Base Unit (5U):• Redundant Infiniband• Redundant Ethernet• Rack Failover Node (Passive)

Extension Base Unit (7U):• 2 HP 1U Servers






Customer Space (9U)• ETL Servers (Landing zone)• Backup Servers• Passive Unit (Additional spares)







HP Configuration

• 2 – 56 compute nodes (32-896 cores)

• 1 – 7 racks

• 1, 2, or 3 TB drives

• 15TB – 1.2PB uncompressed

• 75TB – 6PB User data (5:1)

• Up to 7 spare nodes available across the entire appliance

• Dual Infiband: 56Gbps

Details




Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance

Advanced Analytics Defined

Analytics ExampleDescriptive: How many of our customers left in the last month? How many of these customers where profitable?

Diagnostic: Why did these profitable customers leave?

Predictive: How many profitable customers are likely to leave next month?

Prescriptive: How can we reduce this profitable customer churn rate?

What is Hadoop?

Microsoft Confidential

61

Distributed, scalable system on commodity HW

Composed of a few parts: HDFS – Distributed file system MapReduce – Programming model Other tools: Hive, Pig, SQOOP, HCatalog,

HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm

Main players are Hortonworks, Cloudera, MapR

WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead)

Core Services

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

OOZIE

AMBARI

YARN

MAP REDUCE

HIVE &HCATALOGPIG

HBASEFALCON

Hadoop Clustercompute

&storage . . .

. . .

. .compute

&storage

.

.

Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware

Move HDFS into the warehouse before analysis

HDFS (Hadoop) ETL

WarehouseHDFS (Hadoop)

Learn new skills

TSQL

Build Integrate ManageMaintainSupport

Complex query and analysis with big data todaySteep learning curve, slow and inefficient

Hadoop ecosystem

“New” data sources

Devices

Web Sensor Social

“New” data sources“New” data sources

Devices

Web Sensor Social

APS delivers enterprise-ready Hadoop with HDInsightManageable, secured and highly available Hadoop integrated into the appliance

High performance tuned within the appliance

End-user authentication with Active Directory

Accessible insights for everyone with Microsoft BI tools

Managed and monitored using System Center

100% Apache Hadoop

SQL ServerParallel DataWarehouse

Microsoft HDInsight

PolyBase

Leverage your existing TSQL skills

Additional features over a separate Hadoop cluster

Plus one support contact still!

Parallel Data Warehouse region

HDInsight region

Fabric

Hardware

Appl

ianc

e

A region is a logical container within an appliance

Each workload contains the following boundaries:• Security • Metering • Servicing

APS appliance overview

Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL

Uses the power of MPP to enhance query execution performance

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios

Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera

Use existing SQL skillset, no IT intervention

Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT

SQL ServerParallel DataWarehouse

Cloudera CHD Linux 5.1Hortonworks HDP 2.2 (Windows, Linux)

Windows AzureHDInsight (HDP 2.2) (WASB)

PolyBase

Microsoft HDInsightHDP 2.0

Query re la t i ona l + non re l a t i ona l

Others (SQL Server, DB2, Oracle)? True federated query engine

Use cases where PolyBase simplifies using Hadoop dataBringing islands of Hadoop data together

High performance queries against Hadoop data(Predicate pushdown)

Archiving data warehouse data to Hadoop (move)(Hadoop as cold storage)

Exporting relational data to Hadoop (copy)(Hadoop as backup/DR, analysis,

cloud use)Importing Hadoop data into data warehouse (copy)(Hadoop as staging area, sandbox, Data Lake)

Big data insights for anyoneNative Microsoft BI integration to create new insights with familiar tools

Tools like Power BI minimize ITintervention for discovering dataT-SQL for DBA and power users to join relational and Hadoop data

Hadoop tools like map-reduce, Hive and Pig for data scientists

Leverages high adoptionof Excel, Power View, Power Pivot, and SSAS

Power Users

Data Scientist

Everyone else using Microsoft BI tools





Scale-out Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven)

Multiple nodes with dedicated CPU, memory, storage “shared-nothing”

Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage)

Handles query complexity and concurrency at scale

No “forklift” of prior warehouse to increase capacity

Start small with a few terabyte warehouse Mixed workload support: Query while

you load (250GB/hour per node). No need for maintenance window

Scaling out relational data to petabytesScale-out technologies in the Analytics Platform System

91

PDW

0TB 6PB

PDW or HDInsight

PDW or HDInsight

PDW or HDInsight

PDW or HDInsight

PDW or HDInsight

PDW or HDInsight

Blazing fast performanceMPP and In-memory columnstore for next-generation performance

• Store data in columnar format for massive compression

• Load data into or out of memory for next-generation performance

• Updateable and clustered for real-time trickle loading

• No secondary indexes required

92

Up to 100x faster queries

Updatable clustered columnstore vs. table with customary indexing

Up to 15xmore compression

Columnstore index representation

C1

C3

C5

C4

C2

C6

Parallel query execution

Query

Results

Investment firm Before/After Results - HPSMP vs APS

21x improvement loading data (7:30 minutes vs 21 seconds)

62x improvement staging to landing (30 minutes vs 29 seconds)17x, 166x,

169x query performance improvement (1:05 hour vs 23 seconds)

Microsoft BI tools work unchanged

1.1 TB/hr loading time, 8.8x compression (2 billion rows) (472GB to 53GB)

46x improvement creating datamart (70 minutes vs 1:31 minutes)

BI Tools

Reporting and cubes

SQL Server SMP (Spoke)

Concurrency that fuels rapid adoptionGreat performance with mixed workloads

Analytics Platform SystemETL/ELT with SSIS, DQS,

MDS

ERP CRM LOB APPS

ETL/ELT with DWLoader

Hadoop / Big Data

PDW

HDInsight

PolyBase

Ad hoc queries

Intra-Day

Near real-time

Fast ad hoc

Columnstore

Polybase

CRTAS

“Link Table”

Real-Time

ROLAP / MOLAP DirectQuery

SNAC

Stream Analytics

TransformIngest

Example overall data flow and Architecture

Web logs

Present & decide

IoT, Mobile Devices etc.

Social Data

Event Hubs HDInsight

Azure Data Factory

Azure SQL DB

Azure Blob Storage

Azure Machine Learning

(Fraud detection etc.)

Power BI

Web dashboards

Mobile devices

DW / Long-term storage

Predictive analytics

Event & data producers

Analytics Platform Sys.





APS provides the industry’s lowest DW appliance price/TBReshaped hardware specs through software innovation Price per terabyte for leading vendors (Sept 2014) Significantly lower

price per TB than the closest competitor

Lower storage costs with Windows Server 2012 Storage Spaces

Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack

Oracle Pivotal IBM Teradata Microsoft $-

$20,000

$40,000

$60,000

$80,000

$100,000

$120,000

$140,000 TCO per TB (uncompressed):

Virtualized architecture overview

Host 2

Host 1

Host 3

Host 4

Economical disk

storageIB andEthernet

Direct attached SAS

Base UnitCTL

MAD

AD

VMM

Compute 2

Compute 1

• APS engine• DMS Manager• SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL

2014)

Software details• All hosts run Windows Server 2012 Standard

(AU3: 2012 R2) and Windows Azure Virtual Machines

• Fabric or workload in Hyper-V Virtual Machines

• Fabric virtual machine, management server (MAD01), and control server (CTL) share one server

• APS agent that runs on all hosts and all virtual machines

• DWConfig and Admin Console • Windows Storage Spaces and Azure Storage

blobs• Does not require expertise in Hyper-V or

Windows

APS High-Availability

X XCompute Host 1

Compute Host 2

XControl Host

Failover Host

Infin

iban

d 1

Ethe

rnet

1

Infin

iban

d 2

Ethe

rnet

2

XXXFAB AD VMM MAD CTL

Compute 2 VM

Compute 1 VMCompute 1 VMIn

finib

and

1

Ethe

rnet

1

• No Single Point-Of-Failure• No need for SQL Server

Clustering

Less DBA Maintenance/Monitoring• No index creation• No deleting/archiving data to save space• Management simplicity (System Center, Admin console, DMVs)• No blocking• No logs• No query hints• No wait states• No IO tuning• No query optimization/tuning• No index reorgs/rebuilds• No partitioning• No managing filegroups• No shrinking/expanding databases• No managing physical servers• No patching servers and software

RESULT: DBA’s spend more of their time as architects and not baby sitters!

The no-compromise modern data warehouse solution Microsoft’s turn-key modern data warehouse appliance Analytics Platform System

Microsoft

• Improved query performance• Faster data loading• Improved concurrency• Less DBA maintenance• Limited training needed• Use familiar BI tools• Ease of appliance

deployment• Mixed workload

support

• Improved data compression• Scalability• High availability• PolyBase• Integration with cloud-

born data• HDInsight/Hadoop

integration• Data warehouse

consolidation• Easy support model

Summary of Benefits

Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

James [email protected]

Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/

Microsoft Confidential

H1 CY2015APS Appliance Update 4 APS Appliance Update

5

Analytics Platform System

Microsoft Confidential—Preliminary Information. Dates and capabilities subject to change. Microsoft makes no warranties, express or implied.

Enterprise-ready big data – cloud enabled• Improved PolyBase Support

• Cloudera 5.1 Support• Partial Aggregate Pushdowns

• Expanding Big Data capacity • Grow HDInsight region on an

appliance with an existing region

Next-gen performance & engineered for optimal value• 1.5X data return rate for SELECT *

queries• Streaming large data sets for external

apps (e.g., SSAS, SAS, R, etc.)

Next-gen performance & engineered for optimal value• TSQL Compatibility

• Scalar UDFs (CREATE Function)• SQL Server SMP to APS (SQL

Server MPP) Migration Utility• Bulk load / BCP through SQL

Server command-line tools• OEM Hardware Refresh (HP Gen 9)

• HP ProLiant DL360 Gen9 Server w/2x Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory

• HP 5900 series switches (HA improvements)

Symmetry between DW On-Prem and Azure• Backup from SQL Server/APS• Hybrid APS to Azure Disaster Recovery

T-SQL Compat:Reduced friction DW upsizing from SQL Server to APS

Appliance Hardware• Heterogeneous server hardware

generation support (e.g. mixed racks)• Polybase (Parquet support, String

filter pushdown to Hadoop, MapR support, Kerberos Support)

H2 CY2015