PowerPoint Presentation
SQL Server 2016 PolyBase
Henk van der Valk
Oct.15, 2016
Level: Beginnerwww.Henkvandervalk.com
http://www.sqlsaturday.com/551/Sessions/Schedule.aspx
SQL PolyBase has been an high-end feature for SQL APS and now also introduced in SQL2016, SQL DB and SQLDW! It allows you to use regular T-SQL statements to ad-hoc access data stored in Hadoop and/or Azure Blob Storage from within SQL Server. This session will show you how it works & how to get started!
1
Starting SQL2016 on a server with 24 TB RAM
Just 4 fun!
Microsoft Worldwide Partner Conference 2016 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.15/10/16 14:082
Thanks to our platinum sponsors :PASS SQL Saturday Holland - 20163 |
Please add this slide add the start of your presentation after the first welcome slide
3
Thanks to our gold and silver sponsors :PASS SQL Saturday Holland - 20164 |
APS Onsite!
Please add this slide add the start of your presentation after the first welcome slide4
Speaker Introduction10+ years active in SQLPass community!10 years of Unisys-EMEA Performance Center2002- Largest SQL DWH in the world (SQL2000)Project Real (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)Contributor to SQL performance whitepapersPerf Tips & tricks: www.henkvandervalk.com
Schuberg Philis- 100% uptime for mission critical appsSince april 1st, 2011 Microsoft Data Platform !
All info represents my own personal opinion (based upon my own experience) and not that of Microsoft@HenkvanderValk
AgendaIntro - What is PolyBase & Why?
Getting started - SQL Server product versions supported- Installation & Setup
Creating External Tables, Running hybrid queries
Monitoring- Tips to improve Hadoop performance
Scale out Groups
6
SQL Server 2016 as fraud detection scoring engine
https://blogs.technet.microsoft.com/machinelearning/2016/09/22/predictions-at-the-speed-of-data/
HTAP (Hybrid Transactional Analytical Processing)
8 socket, 192 cores16 TB RAM
The Big Data lake Challenge
How to orchestrate?
Different types of dataWebpages, logs, and clicksHardware and software sensorsSemi-structured/unstructured dataLarge scaleHundreds of serversAdvanced data analysisIntegration between structured and unstructured dataPower of both
PolyBase builds the BridgeRDBMSHadoopPolyBaseAccess any data
Azure Blob StorageJust-in-Time data integration Across relational and non-relational dataFast, simple data loading Best of both worldsT-SQL compatible Uses computational power at source Opportunity for new types of analysis
15/10/16 13:369
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
PolyBase View in SQL Server 2016PolyBase ViewExecute T-SQL queries against relational data in SQL Server and semi-structured data in HDFS and/or AzureLeverage existing T-SQL skills and BI tools to gain insights from different data storesExpand the reach of SQL Server to Hadoop(HDFS & WASB)
SQL ServerHadoopAzure Blob Storage
QueryResultsAccess any data
10
Remove the complexity of big dataT-SQL over HadoopJSON support
PolyBase T-SQL query
SQL Server
Hadoop
Quote:
**************************************************************************************************************** $658.39
Simple T-SQL to query Hadoop data (HDFS)Manage structured & unstructured data NameDOBStateDenny Usher11/13/58WAGina Burch04/29/76WA
NEWNEWNEW
Server & Tools Business 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.15/10/1611
PolyBase use casesLoad data
Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase
Interactively query
Analyze relational data with semi-structured data using split-based query processing
Age-out data
Age-out data to HDFS and use it as cold but queryable storage
Access any data
15/10/16 12:1112
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Polybase - Turning raw data tweets into information
Query & Store Hadoop data Bi-directional seamless & fast
Azure Blob Storage
Setup & QuerySQL Server 2016 & SQL DW Polybase!
#Demo
BCP out vs RTC 16
Prerequisites
An instance of SQL Server (64-bit) Ent.Ed. / Developer Ed..Microsoft .NET Framework 4.5.Oracle Java SE RunTime Environment (JRE) version 7.51 or higher (64-bit). (Either JRE or Server JRE will work). Go to Java SE downloads.
Note:The installer will fail if JRE is not present.
Minimum memory: 4GBMinimum hard disk space: 2GBTCP/IP connectivity must be enabled.
Step 2: Install SQL Server
SQL16PolyBaseDLLs
SQL16PolyBaseDLLs
SQL16PolyBaseDLLs
SQL16PolyBaseDLLsInstall one or more SQL Server instances with PolyBase
PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services
Prerequisite: User must download and install JRE (Oracle)
Access any data
15/10/16 14:2118
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Components introduced in SQL Server 2016PolyBase Engine ServicePolyBase Data Movement Service (with HDFS Bridge)External table constructsMR pushdown computation support
Access any data
How to use PolyBase in SQL Server 2016Set up a Hadoop Cluster or Azure Storage blob
Install SQL Server
Configure a PolyBase group
- Choose Hadoop flavor- Attach Hadoop Cluster or Azure StoragePolyBase T-SQL queries submitted herePolyBase queries can only refer to tables here and/or external tables here
Compute nodes
Head nodes
Access any dataHadoopCluster
Step 1: Set up a Hadoop Cluster
Hortonworks or Cloudera DistributionsHadoop 2.0 or above Linux or WindowsOn-premises or in AzureAccess any dataHadoopCluster
15/10/16 12:1121
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Step 1: Or set up an Azure Storage blobAzure Storage blob (ASB) exposes an HDFS layerPolyBase reads and writes from ASB using Hadoop RecordReader/RecordWriteNo compute pushdown support for ASB
AzureStorageVolumeAzureStorageVolumeAzureStorageVolumeAzureAccess any data
15/10/16 12:1122
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Step 2: Configure a PolyBase group
SQL16PolyBaseEngine
SQL16
SQL16
SQL16PolyBaseDMSPolyBaseDMSPolyBaseDMSPolyBaseDMSHead nodeCompute nodes
PolyBase scale-out group
Head node is the SQL Server instance to which queries are submitted
Compute nodes are used for scale-out query processing for data in HDFS or Azure
15/10/16 12:1123
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Step 3: Choose /Select Hadoop flavorSupported Hadoop distributions Cloudera CHD 5.x on LinuxHortonworks 2.x on Linux and Windows Server
What happens under the covers? Loading the right client jars to connect to Hadoop distribution
Access any data
Step 4: Attach Hadoop Cluster or Azure Storage
SQL16PolyBaseEngine
SQL16
SQL16
SQL16PolyBaseDMSPolyBaseDMSPolyBaseDMSPolyBaseDMSHead node
AzureStorageVolumeAzureStorageVolumeAzureStorageVolumeAzureAccess any dataHadoopCluster
After Setup
Compute nodes are used for scale-out query processing on external tables in HDFS
Tables on compute nodes cannot be referenced by queries submitted to head node
Number of compute nodes can be dynamically adjusted by DBA
Hadoop clusters can be shared between multiple SQL16 PolyBase groups
PolyBase T-SQL queries submitted herePolyBase queries can only refer to tables here and/or external tables here
Compute nodes
Head nodes
Access any dataHadoopCluster
- Improved PolyBase query performance with scale-out computation on external data (PolyBase scale-out groups)- Improved PolyBase query performance with faster data movement from HDFS to SQL Server and between PolyBase Engine and SQL Server
26
Polybase configuration
--1: Create a master key on the database. -- Required to encrypt the credential secret. CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'SQLSat#551'; -- select * from sys.symmetric_keys-- Create a database scoped credential for Azure blob storage. -- IDENTITY: any string (this is not used for authentication to Azure storage). -- SECRET: your Azure storage account key. CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'wasbuser', Secret = '1abcdEFGb3Mcn0F9UdJS/10taXmr5L17xrEO17rlMRL8SNYg==';
Create external Data Source--2: Create an external data source. -- LOCATION: Azure account storage account name and blob container name. -- CREDENTIAL: The database scoped credential created above. CREATE EXTERNAL DATA SOURCE AzureStorage with ( TYPE = HADOOP, LOCATION ='wasbs://[email protected]', CREDENTIAL = AzureStorageCredential );
-- view list of external data sources;select * from sys.external_data_sources
Create External file format--select * from sys.external_file_formats --3: Create an external file format. -- FORMAT TYPE: Type of format in Hadoop -- (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- With GZIP: CREATE EXTERNAL FILE FORMAT TextDelimited_GZIP WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS (FIELD_TERMINATOR ='|',USE_TYPE_DEFAULT = TRUE), DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec' );
Create External Table--4: Create an external table. -- The external table points to data stored in Azure storage. -- LOCATION: path to a file or directory that contains the data (relative to the blob container). -- To point to all files under the blob container, use LOCATION='/' CREATE EXTERNAL TABLE [dbo].[lineitem4] ([ROWID1] [bigint] NULL,[L_SHIPDATE] [smalldatetime] NOT NULL,[L_ORDERKEY] [bigint] NOT NULL,[L_DISCOUNT] [smallmoney] NOT NULL,[..[L_COMMENT] [varchar](44) NOT NULL) WITH (LOCATION='/', DATA_SOURCE = AzureStorage, FILE_FORMAT = TextFileFormat,REJECT_TYPE = VALUE, REJECT_VALUE = 0));
Import-------------------------------------- IMPORT Data from WASB into NEW table:------------------------------------SELECT * INTO [dbo].[LINEITEM_MO_final_temp] from ( SELECT * FROM [dbo].[lineitem1]) AS Import
Export data (Gzipped)
-- Enable Export/ INSERT into external table sp_configure 'allow polybase export', 1; Reconfigure
CREATE EXTERNAL TABLE [dbo].[lineitem_export] ([ROWID1] [bigint] NULL,..[L_SHIPINSTRUCT] [varchar](25) NOT NULL,[L_COMMENT] [varchar](44) NOT NULL) WITH (LOCATION='/gzipped', DATA_SOURCE = AzureStorage, FILE_FORMAT = TextDelimited_GZIP,REJECT_TYPE = VALUE, REJECT_VALUE = 0)
ManageExternal resourcesSSMS / VSTS
New:- External Tables- External ResourcesExt. Data SourcesExt. File formats
PolyBase query example #1-- select on external table (data in HDFS) SELECT * FROM Customer WHERE c_nationkey = 3 and c_acctbal < 0;A possible execution plan:CREATE temp table TExecute on compute nodes1IMPORTFROM HDFSHDFS Customer file read into T2EXECUTEQUERYSelect * from T where T.c_nationkey =3 and T.c_acctbal < 03Access any data
Additionally - there is- Support for exporting data to external data source via INSERT INTO EXTERNAL TABLE SELECT FROM TABLE- Support for push-down computation to Hadoop for string operations (compare, LIKE)- Support for ALTER EXTERNAL DATA SOURCE statement
34
PolyBase query example #2-- select and aggregate on external table (data in HDFS) SELECT AVG(c_acctbal) FROM Customer WHERE c_acctbal < 0 GROUP BY c_nationkey;Execution plan:Run MR Job on HadoopApply filter and compute aggregate on Customer. 1 What happens here? Step 1: QO compiles predicate into Java and generates a MapReduce (MR) job Step 2: Engine submits MR job to Hadoop cluster. Output left in hdfsTemp.hdfsTemp
Access any data
35
PolyBase query example #2-- select and aggregate on external table (data in HDFS) SELECT AVG(c_acctbal) FROM Customer WHERE c_acctbal < 0 GROUP BY c_nationkey;Execution plan:Predicate and aggregate pushed into Hadoop cluster as a MapReduce job
Query optimizer makes a cost-based decision on what operators to pushRun MR Job on HadoopApply filter and compute aggregate on Customer. Output left in hdfsTemp 1IMPORThdfsTEMPRead hdfsTemp into T 3CREATE temp table TOn DW compute nodes2RETURN OPERATIONSelect * from T4hdfsTemp
Access any data
36
Query relational and non-relational data, on-premises and in Azure
AppsT-SQL query
SQL ServerHadoopCapabilityT-SQL for querying relational and non-relational data across SQL Server and Hadoop BenefitsNew business insights across your data lakeLeverage existing skill sets and BI toolsFaster time to insights and simplified ETL process
Summary: PolyBase
Query relational and non-relational data with T-SQLAccess any data
When it comes to key BI investments, we are making it much easier to manage relational and non-relational data. PolyBase technology allows you to query Hadoop data and SQL Server relational data through a single T-SQL query. One of the challenges we see with Hadoop is there are not enough people knowledgeable in Hadoop and MapReduce, and this technology simplifies the skill set needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.Server & Tools Business 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.15/10/1637
Monitoring Polybase Queries
Lots of new DMVs------------------------------------------ Monitoring Polybase / All DMV's :----------------------------------------SELECT * FROM sys.external_tablesSELECT * FROM sys.external_data_sourcesSELECT * FROM sys.external_file_formats
SELECT * FROM sys.dm_exec_compute_node_errorsSELECT * FROM sys.dm_exec_compute_node_status SELECT * FROM sys.dm_exec_compute_nodes SELECT * FROM sys.dm_exec_distributed_request_steps SELECT * FROM sys.dm_exec_dms_services
SELECT * FROM sys.dm_exec_distributed_requests SELECT * FROM sys.dm_exec_distributed_sql_requests SELECT * FROM sys.dm_exec_dms_workers SELECT * FROM sys.dm_exec_external_operations SELECT * FROM sys.dm_exec_external_work
Find the longest running query -- Find the longest running query SELECT execution_id, st.text, dr.total_elapsed_time FROM sys.dm_exec_distributed_requests dr cross apply sys.dm_exec_sql_text(sql_handle) st ORDER BY total_elapsed_time DESC;
Find the longest running step of the distributed query plan
-- Find the longest running step of the distributed query plan SELECT execution_id, step_index, operation_type, distribution_type, location_type, status, total_elapsed_time, command FROM sys.dm_exec_distributed_request_steps WHERE execution_id = 'QID1120' ORDER BY total_elapsed_time DESC;
Details on a Step_index
SELECT execution_id, step_index, dms_step_index, compute_node_id, type, input_name, length, total_elapsed_time, status FROM sys.dm_exec_external_work WHERE execution_id = 'QID1120' and step_index = 7ORDER BY total_elapsed_time DESC;
Optimizations
Polybase - data compression to minimize data movement
http://henkvandervalk.com/aps-polybase-for-hadoop-and-windows-azure-blob-storage-wasb-integration
Enable Pushdown configuration (Hadoop)Improves query performance
Find the file yarn-site.xml in the installation path of SQL Server.
C:\Program Files\Microsoft SQL Server\MSSQL13.SQL2016RTM\MSSQL\Binn\Polybase\Hadoop\conf \ yarn-site.xml
On the Hadoop machine:in the Hadoop configuration directory. Copy the value of the configuration key yarn.application.classpath.
On the SQL Server machine, in the yarn.site.xml file, find the yarn.application.classpath property. Paste the value from the Hadoop machine into the value element.
Time to InsightsAPS Cybercrime Filmpje & Demo!
Various sourcesSingle query
Further Reading
Get started with Polybase:https://msdn.microsoft.com/en-us/library/mt163689.aspx
Data compression tests:http://henkvandervalk.com/aps-polybase-for-hadoop-and-windows-azure-blob-storage-wasb-integration
www.henkvandervalk.com
Please fill in the evaluation forms
Please add this slide add the end of your presentation to get feedback from the audience49