Modern Data Warehousing

  • Published on
    08-Sep-2014

  • View
    1.322

  • Download
    4

Embed Size (px)

DESCRIPTION

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle big data and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Parallel Data Warehouse (PDW) from Microsoft, which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and PDW. I will give an overview of the PDW hardware and software architecture, identify what makes PDW different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

Transcript

  • Modern Data Warehousing Insights on Any Data of Any Size James Serra, Microsoft PDW Technology Solution Professional JamesSerra3@gmail.com JamesSerra.com
  • About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book Reporting with Microsoft SQL Server 2012
  • Agenda Traditional data warehouse & modern data warehouse APS architecture Hadoop & PolyBase Performance and scale Appliance benefits Summarize/questions
  • 4 Data sources Will your current solution handle future needs?
  • 5 Data sourcesNon-Relational Data
  • Data sources Non-relational data
  • Keep legacy investment Buy new tier one hardware appliance Acquire big data solution (Hadoop) Acquire business intelligence solution Roadblocks to evolving to a modern data warehouse Limited scalability & ability to handle new data Significant training & still siloed High acquisition/ migration costs & no Hadoop Complex with low adoption Solution and issue with that solution
  • Introducing the Microsoft Analytics Platform System Your turnkey modern data warehouse appliance Relational and non-relational data in a single appliance Enterprise-ready Hadoop Integrated querying across Hadoop and APS using T-SQL Direct integration with Microsoft BI tools such as Power BI Near real-time performance with In-Memory Scale-out to accommodate your growing data Remove DW bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industrys lowest DW price/TB Value through a single appliance solution Value with flexible hardware options using commodity hardware Free up space on SAN
  • Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Pre-configured, built, and tuned software and hardware Integrated support plan with a single Microsoft contactPDW HDInsight PolyBase
  • APS Architecture Microsoft Analytics Platform System (APS), formally called by its code name Project Madison, was released in December 2010 (version 1). PDW is Microsofts reworking of the DatAllegro Inc. massive parallel processing (MPP) product started in 2003 and that Microsoft acquired in September 2008. Version 2 of PDW was made available in March, 2013. It was renamed from SQL Server Parallel Data Warehouse (PDW) to Analytics Platform System (APS) in April 2014 (it still includes the PDW region as well as a new HDInsights/Hadoop region). Polybase was introduced with version 2 of PDW and has new features in PDW v2 AU1 (April 2014). Case studies: http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20 Data%20Warehouse%22&LangID=46
  • APS Logical Architecture (overview) Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS Compute Node the worker bee of APS Runs SQL Server 2012 APS Contains a slice of each database Control Node the brains of the APS Also runs SQL Server 2012 APS Holds a shell copy of each database Metadata, statistics, etc The public face of the appliance Data Movement Services (DMS) Part of the secret sauce of APS Moves data around as needed Enables parallel operations among the compute nodes (queries, loads, etc) Control node SQL DMS
  • APS Logical Architecture (querying) Compute node Balanced storage SQLControl node SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) APS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user
  • APS Data Layout Options Compute node Balanced storage SQL Balanced storage Balanced storage Balanced storage Compute node SQL Compute node SQL Compute node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D SalesFact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on hash Star Schema
  • FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C Create Table FactSales_H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSale B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H Create table metadata on Control Node
  • APS Balanced across servers and within 15 Largest Table 600,000,000,000 Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000 In each server randomly distributed to 8 tables 1,875,000,000 Each partition 2 years data partitioned by week 18,028,846 As an end user or DBA you think about 1 table: LineItem. You run select * from LineItem APS is an appliance, simple to use! You dont care or need to know that there are actually 320 tables representing your 1 logical table.
  • Rack 15TB(Raw) 1/2Rack 30TB(Raw) FullRack 60TB(Raw) 1Rack 75.5TB (Raw) 3Rack 181.2TB(Uncompressed) 11/2Rack 90.6TB(Raw) 2Rack 120.8TB(Raw) 2 56 compute nodes (32- 896 cores) 1 7 racks 1, 2, or 3 TB drives 15TB 1.2PB uncompressed 75TB 6PB User data (5:1) Up to 7 spare nodes available across the entire appliance Dual Infiband: 56Gbps
  • Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • What is big data and why is it valuable to the business A evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes/Volume Historical analysis Insight analysis Predictive analytics Predictive forecasting Valuetothebusiness
  • Microsoft Confidential Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • Move HDFS into the warehouse before analysis ETL Learn new skills TSQL Build Integrate Manage Maintain Support Complex query and analysis with big data today Steep learning curve, slow and inefficient Hadoop ecosystem New data sources New data sourcesNew data sources
  • APS delivers enterprise-ready Hadoop with HDInsight Manageable, secured and highly available Hadoop integrated into the appliance High performance tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100% Apache Hadoop SQL Server Parallel Data Warehouse Microsoft HDInsight PolyBase Leverage your existing TSQL skills
  • Parallel Data Warehouse workload HDInsight workload Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: Security Metering Servicing APS appliance overview
  • Select Result set Provides a single T-SQL query model (semantic layer) for APS and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera Use existing SQL skillset, no IT intervention Query Hadoop data with T-SQL using PolyBase Bringing the worlds or big data and the data warehouse together for users and IT SQL Server Parallel Data Warehouse Cloudera CHD Linux 4.3 Hortonworks HDP 2.0 (Windows, Linux) Windows Azure HDInsight PolyBase Microsoft HDInsight HDP 1.3 Others? Federated querying AU1: Windows Azure storage blob (WASB)
  • Use cases where PolyBase simplifies using Hadoop data Bringing islands of Hadoop data together High performance queries against Hadoop data (Predicate pushdown) Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area)
  • Big data insights for anyone Native Microsoft BI integration to create new insights with familiar tools Tools like Power BI minimize IT intervention for discovering data T-SQL for DBA and power users to join relational and Hadoop data Hadoop tools like map- reduce, Hive and Pig for data scientists Leverages high adoption of Excel, Power View, Power Pivot, and SSAS Power Users Data Scientist Everyone else using Microsoft BI tools
  • Microsoft Analytics Platform System Your turnkey modern data warehouse appliance
  • Performance limitations and scale with a traditional data warehouse Diminishing scale as requirements grow Scale up Rowstore Sub-optimal performance for many data warehouse queries Data Page 1 Page 2 Page 3 Querying data by row C1 C2 C3 C4 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 R5 R5 R5 R5 R6 R6 R6 R6 Forklift Forklift
  • Scale-out Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven) Multiple nodes with dedicated CPU, memory, storage shared-nothing Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage) Handles query complexity and concurrency at scale No forklift of prior warehouse to increase capacity Start small with a few terabyte warehouse Scale-out technologies in the Analytics Platform System 28 PDW 0TB 6PB PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight PDW/ HDInsight
  • Store data in columnar format for massive compression Load data into or out of mem...