DB2 Common SQL Engine- Whats new on Hybrid Cloud and IIAS … · 2019. 3. 6. · 2. Improve workload concurrency 3. Improve individual query and overall workload performance Some

© 2019 IBM Corporation

Db2 Common SQL Engine Enhancements Update: What's new in Db2 Hybrid cloud and IIASVincent Kulandaisamy ([email protected])https://www.linkedin.com/in/vincentkulandaisamy

Mar 06 2019

CopyrightIBMCorporation2019. Allrightsreserved. U.S.GovernmentUsersRestrictedRights- use,duplication,ordisclosurerestrictedbyGSAADPScheduleContractwithIBMCorporation.

•IBM,theIBMlogo,andibm.com aretrademarksorregisteredtrademarksofInternationalBusinessMachinesCorporationintheUnitedStates,othercountries,orboth. IftheseandotherIBMtrademarkedtermsaremarkedontheirfirstoccurrenceinthisinformationwithatrademarksymbol(®orTM),thesesymbolsindicateU.S.registeredorcommonlawtrademarksownedbyIBMatthetimethisinformationwaspublished. Suchtrademarksmakealsoberegisteredorcommonlawtrademarksinothercountries.AcurrentlistofIBMtrademarksisavailableontheWebat“Copyrightandtrademarkinformationat: ibm.com/legal/copytrade/shtml.•Theinformationcontainedinthispresentationisprovidedforinformationalpurposeonly. Whileeffortsweremadetoverifythecompletenessandaccuracyoftheinformationcontainedinthispresentation,itisprovided“asis”withoutwarrantyofanykind,expressedorimplied.IBMshallnotberesponsibleforanydamagesarisingoutoftheuseof,orotherwiserelatedto,thispresentationoranyotherdocumentation.•Theinformationmentionedregardingpotentialfutureproductsisnotacommitment,promise,orlegalobligationtodeliveranymaterial,codeorfunctionality.Informationaboutpotentialfutureproductsmaynotbeincorporatedintoanycontract.Nothingcontainedinthispresentationisintendedto,norshallhavetheeffectof,creatinganywarrantiesorrepresentationsfromIBM(oritssuppliers orlicensors),oralteringthetermsandconditionsofanyagreementorlicensegoverningtheuseofIBMproductsand/orsoftware.•AnystatementsofperformancearebasedonmeasurementsandprojectionsusingstandardIBMbenchmarksinacontrolledenvironment.Theactualthroughputorperformancethatanyuserwillexperiencewillvarydependinguponmanyfactors,includingconsiderationssuchastheamountofmulti-programmingintheuser’sjobstream,theI/Oconfiguration,thestorageconfiguration,andtheworkloadprocessed.Therefore,noassurancecanbegiventhatanindividualuserwillachieveresultssimilartothosestated.•IBM’sstatementsregardingitsplans,directions,andintentaresubjecttochangeorwithdrawalwithoutnoticeatIBM’ssole discretion.Thedevelopment,release,andtimingofanyfuturefeaturesorfunctionalitydescribedforourproductsremainsatoursolediscretion.Informationregardingpotentialfutureproductsisintendedtooutlineourgeneralproductdirectionanditshouldnotberelied oninmakingapurchasingdecision.”

SafeHarborStatementandDisclaimer

2

IBM ©2018 IBM Corporation

Db2 on Cloud—Fully-managed, cloud transactional data store

Integrated Analytics System—Dedicated analytics appliance

Db2 & Db2 Warehouse—Transactional or analytics SQL database deployed on commodity hardware

Db2 Big SQL—Open source Hadoop with Hortonworks

Our family of Hybrid Data

Management solutions

built on the Db2common SQL

engine

Write your SQL once

deploy against any form factor

run anywhere

Cloud

Db2 SQL Engine

Db2 Hosted—Not managed - we install Db2 and hand the keys over to you

Cloud

Hosted Analytics with Hortonworks—Hosted Hadoop deployment with Big SQL and Data Science Experience

Cloud

Db2 Warehouseon Cloud—Fully-managed, cloud data warehouse

Cloud

3

Offers clients choice in selecting the best (combination of) data stores to satisfy hybrid data warehouse solution needs

BuiltonacommonandfluidanalyticsSQLengineenablingtruehybriddatasolutionswithportableanalytics

Db2WarehouseOnCloud

IntegratedAnalyticsSystem

Db2 BigSQLDb2Warehouse

Common Skill SetOne skill set for all deploymentDrive Higher efficiencies and portfolio rationalization

Operational compatibility Reuse operational and housekeeping procedures

Next-gen analytics Common programming model for in-DB analytics

Data VirtualizationCommon Fluid Query capabilities for query federation and data movement

Application AgilityWrite once, run anywhereOne ISV product certification for all platforms

The Hybrid DW Strategy: Write once, run anywhere

4

Licensing Flexible entitlements for business agility & cost-optimization

DBaaS(MANAGED)

DBaaS(HOSTED)PaaSPRIVATE

CLOUDAPPLIANCEdocker

Db2 Warehouse

Db2: Flexibility of Deployment

5

On PremiseOn Cloud

SOFTWARE

CONTROL SIMPLICITY

Db2 Warehouse

IBM Integrated Analytics System Db2 Hosted Db2 on Cloud

Db2 Warehouse on Cloud

BLUAccelerationNewEraofSmart

BLU Acceleration

6

IBM Research & Development Lab Innovations

BLU Acceleration

Analyze born on the Cloud data setsYour mobile apps, IoT devices, Web apps are all generating volumes of data. Stick that data in a Cloud data warehouse.

IBM ©2018 IBM Corporation 7

Offload your analyticsTransactional databases are optimized for transactional workloads. Offload your big data projects to a database optimized for the job.

Modernize your data warehouseTransform your data warehouse with a highly performant, in-memory/columnar, in-database analytics and simplify technology processes with data virtualization (federation), scalability and workload portability.

Extend your on-premises data store to the CloudCloud brings the promise of elasticity, performance, cost savings and business agility. Take advantage of it.

Helping machinery operators around the world predict and prevent equipment failureLearn more

Analyzing data from wi-fi access points to enhance tourist recommendationsLearn more

Performing advanced analytics on sales data to drive customer engagementLearn more

Training machine learning and deep learning models to personalize shopping experiencesLearn more

Analyzing global viewership data to optimize content delivery

Centralizing operational data to accelerate reporting and business planningLearn more

Read more: ibm.com/cloud/db2-warehouse-on-cloud

Our Db2 Warehouse Cloud footprint keeps growing…Here’s what some of our public reference customers are doing today

IBM ©2018 IBM Corporation

KONE expects to ease its own maintenance efforts by connecting the KONE global maintenance base of more than 1.1 million elevators and escalators to the IBM Watson technology-driven, cloud-based system that helps predict maintenance needs before equipment can fail. Technicians have more information at their fingertips when they arrive at a service call, which speeds problem resolution and gets equipment back online quicker.

“From a pure technology standpoint, the heart and soul of everything we do at RSG Media is IBM Db2 Warehouse on Cloud. We have benchmarked it against leading competitors such as Snowflake and Amazon Redshift, and Db2 Warehouse on Cloud gives us the best price-performance ratio.”—Shiv Sehgal, CEO, Media Mantra at RSG Media

“Moving to cloud has reduced our on-premises hardware footprint, and we have been able to cut IT operational costs by 35 percent. At the same time, we have escaped the capacity limitations of the on-premises model, and can scale our cloud environment up to run more reports in parallel when we need to.”—Vimal Dev, Vice President – IT, Global Enterprise Applications Leader, Genpact

“The response from IBM was amazing—they were so positive about the idea and immediately looked into how they could best support this initiative. We talked to other AI companies as well, but IBM’s response was the most impressive by far.”—Tanja Hoel, Director, Seafood Innovation Cluster

Here’s what our customers are saying about us…

Think 2019 / DOC ID / February, 2019 / © 2019 IBM Corporation8

Performance Improvements in Db2 Columnar Engine (a.k.aDb2 BLU)

9

Goals1. Reduce overall memory footprint and demand on system resources and improve

utilization2. Improve workload concurrency3. Improve individual query and overall workload performance

Some of the note worthy enhancements in the following category were delivered to Db2 Warehouse, Db2 Warehouse on Cloud, IDAA and IIAS

1. Rewrite Optimizations2. Optimizer Improvements3. Runtime Improvements4. Compression and Data Movement5. SQL Compatibility6. WLM

These and many other features will be available in future Db2 on-premises releases eventually.

Rewrite Optimizations

1. Reduce UNION ALL over disjoint branches

2. Transform UNION over DUAL tables to VALUES

3. Perform VALUES Cartesian joins at query rewrite

4. Push down/fold COALESCE/NVL into the VALUES

5. Enforce early DISTINCT (Distinct to Group BY)

6. Eliminate redundant TO_DATE casting functions

10

1. UNION ALL Reduction over Disjoint Branches

• Transforms UNION ALL over disjoint branches into a simple SELECT with CASE expressions

• Significant compile time reduction

11

2. UNION over DUAL tables to VALUES

• Transforms UNION over single-row DUAL tables into a multi-row Values operation

• Significant compile time reduction

12

3. VALUES Cartesian Join in Query Rewrite• Pre-computes Cartesian joins between Values box in Rewrite• Avoids multiple NLJNs with the fact table

13

4. Push COALESCE/NVL into the VALUES box• Pushes down and folds COALESCE/NVL scalar function into the VALUES box

• Without a COALESCE pushed down, we could get a NLJN instead of HSJN if the Values box and the Coalesce box were split in the plan.

• Improves selectivity estimates when COALESCE was referenced in predicate

14

5. Enforce Early Distinct/Distinct to Group By• Keeps the DISTINCT in-place rather than let the rewrite engine pull it up in the graph to

be merged into other operations; to favor CSEs, replace DISTINCT by GROUP BY operator• Significant benefits by removing duplicates early• Avoids manual rewrites in the queries and views

15

6. Remove Redundant TO_DATE Casting Function• TO_DATE is redundant when applied on a DATE type column• Depending on the context (e.g. predicates, select list), the function can be replaced by the

source column or TIMESTAMP()• Provides better selectivity estimates when removed from predicates• Reduction in execution time as TO_DATE is expensive

16

Optimizer Improvements

17

1. Smart statistics collection on MPP

2. Local sort for insert-subselect with order by

1. Smarter Statistics Collection on MPP

Old Behavior- Table statistics extrapolated from first partition in the table’s partition group

New Behavior- Table statistics extrapolated from first non empty table partition

– Improves robustness, performance by better query optimizer planning

18

MyTablerows: 0

MyTablerows: 0

MyTablerows: 12

Partition 1 Partition 2 Partition 3

2. Local Sort for Insert-Subselect with Order-By

High Level

- Identify intra-partition data movement - Avoid needless processing by coordinator- Avoid FCM and TQ processing

Applicability

- CTAS or Insert From Select Scenarios with Order By- Both Source and Target data must be on the same partition

Value Proposition

- Significantly improves table sorting and ETL operations

19

2. Local Sort for Insert-Subselect with Order-By (Cont’d)

Without Local Sort With Local Sort

20

Source Data Nodes

Coordinator Node

Target Data Nodes

Source Data Nodes

Target Data Nodes

• During an Insert-Subselect with Order-by, the sorting for the order-by can be done locally without sending rows to the coordinator node.

• E.g. insert into t1 select * from t2 order by c1;

Runtime Improvements

21

1. Efficient sorting of skewed data

2. Improved memory utilization on GROUP BY with skewed data

3. Improved memory utilization on aggregation distinct with wide columns

4. Hash join Residual Predicates support

5. Efficient system resources utilization on complex queries

1. Efficient Sorting of Skewed Data

22

• Depending on data characteristics, heavily skewed data might create data chunks that have majority duplicate and minority distinct values

New enhancement: Efficient sorting with significantly less memory • Identifies majority duplicate value and skips sorting duplicates• Sorts only minority distinct values• Emits the values in the right sorted order

• Improved SQL ORDER BY performance

Duplicate value chunkMinority distinct values chunk

Data chunk

2. Improved Memory Utilization on GROUP BY

23

• Depending on data characteristics, heavily skewed data might create highly imbalanced distribution across the chunks when partitioning the data during GROUP BY processing

– leads to higher memory footprint during grouping and aggregation

New enhancement

• Improved parallelism on grouping/aggregation and spilling data chunks when available memory is constrained

• Significantly reduces memory footprint and improves performance

3. Improved Memory Utilization on Aggregation Distinct

24

• Aggregation distinct on wider columns significantly increases memory footprint and affects query performance

• Example:create table T (store_id BIGINT, prodname VARCHAR(2000), storename VARCHAR(5000)…) organize

by column;select store_id, count (distinct prodname), count(distinct storename) from T group by store_id;

• actual width of the tuples may be significantly shorter than schema column width

New enhancement

• Varying length values are stored in compact form during aggregation distinct processing

• significant memory reduction!

• improved aggregation distinct performance when actual width is smaller than schema column width

• New enhancement is available under a registry knob

4. Hash Join Residual Predicates Support

25

• What are join residual predicates?• Predicates applied against non join key column(s)

• For outer and anti-joins, predicates must reference row preserving side

• For inner joins, predicates must reference both sides of the join tables

• Can be simple equality, range predicates or expressions

4. Hash Join Residual Predicates Support (Cont’d)

26

INNER JOIN (IJ)

LEFT OUTER JOIN (LOJ)

RIGHT OUTER JOIN (ROJ)

LEFT ANTI JOIN (LAJ)

RIGHT ANTI JOIN (RAJ)

Necessary conditions:- IJ: Must reference both the fact and dimension sides.- OJ/AJ: Must reference at least the row preserving

side.- LOJ/LAJ: Fact side is row preserving- ROJ/RAJ: Dimension side is row preserving

Equality JOIN predicates, Residual predicates and Local predicates together return the results for a given JOIN

select * from t1 left outer join t2 on t1.c1= t2.c1 ç JOIN_PREDand t1.c2 <= 5 ç RESID_PREDand t1.c3 LIKE ‘%mario%’ ç RESID_PREDand t2.c4 IN (10, 100, 1000) ç LOCAL PRED

4. Hash Join Residual Predicates Support (Cont’d)

27

4. Hash Join Residual Predicates Support (Cont’d)Supported Predicates

• LIKE, IN, ISNULL, ISNOTNULL, EQ, NEQ, GEQ, LEQ, LT, GT, BETWEEN, OVERLAPS

Residual predicates with expressions:

• Implicit casts: e.g., t1.c1 < t2.c1 // t1.c1 and t2.c1 are different comparable data types

• Explicit casts: e.g., CAST (t1.c1 as BIGINT) < t2.c1

• Scalar functions: e.g., MOD(t1.c3, 2) < t2.c3

• Arithmetic expressions: e.g., t1.c1 + 5 > t2.c1

• Any combination of above

Not supported:

• Predicates where a single operand references both sides of JOIN. e.g., t1.c1 + t2.c1 <resid_pred> op2

28

4. Hash Join Residual Predicates Performance

29

NOTE: Performance results based on internal targeted workloads. Individual mileage may vary

4. Hash Join Residual Predicates Performance (Cont’d)

30

NOTE: Performance results based on internal targeted workloads. Individual mileage may vary

5. Efficient Resource Utilization on Complex Queries

31

• Efficient reuse/repurpose of the Db2 system agents and resources for executing different query blocks / subsections of a complex query plan

• Significantly reduces number of active agents and system resource utilization

select 't1' as tablename, count(*) from t2 where c1 = 0union allselect 't2' as tablename, count(*) from t2 where c1 = 0union allselect 't3' as tablename, count(*) from t3 where c1 = 0. union allselect 't4' as tablename, count(*) from t4 where c1 = 0

Example 1

5. Efficient Resource Utilization on Complex Queries (Cont’d)

32

select col1, count(distinct col3), count(distinct col4), count(distinct col6), count(distinct col7), count(distinct col8), count(distinct col9), count(distinct col10), count(distinct col11), count(distinct col12) from ua100k group by col1

Example 2

Compression & Data Movement

33

Compression Enhancements1. Enhancements to SQL-based insert and update statements

• Significant optimizations to process large volume of data more efficiently and faster

• Optimized and improved compression

2. REORG TABLE enhancement to further improve compression

• Recompress the initial data inserted before the creation of the dictionary

• After dictionary is built, initial data is recompressed with new dictionary

• Completely automated so compression should improve with no user intervention

34

Un-encodedData EncodedData

EncodedData

Bulk Insert & Update Performance§ SMP parallelism : parallelized the INSERT processing to use multiple

cores/threads§ Vectorization of the runtime of BLU bulk insert§ Log reservation§ Encoding improvement§ Code path optimization

• Bulk insert is around 4X faster compared to Db2 v11.1!

35

Reduced Logging

Reduced logging improvements:– Consists of two parts, reduced undo logging and

reduced redo logging.– Both parts share the following:

• Applies only to column organized tables.• Applies to any bulk load operation which drives

insert internally. (e.g. Insert from subselect, Update, Merge, etc.)

Reduced Undo logging improvements– Avoids undo processing and logging for data page

contents.– No new functional restrictions/limitations.– Results in significantly less active log space needed

for bulk insert (~40% less in our testing).

Reduced Redo logging improvements:– Log meta data changes, but skip logging of page

contents.– Similar to NLI tables, but with improved

recoverability and concurrency.

Table contents will be preserved during:– Rollback– Crash recovery– Recovery to the time of the last backup

Table contents are preserved to end of backup for any online backup.

Total impact: 95% reduction in required log space

External Tables

37

A simple mechanism to treat external “files” as a database table using SQL statement

• Can also be used to load from or unload to external files

• Can be used to define a permanent external table or directly within a SQL statement

• Currently supports GZIP and uncompressed file formats

• Sources can be local/remote sources including object stores Swift and AWS S3

Example:create external table ext_orders(order_num INT, order_dt TIMESTAMP)

USING(dataobject('/tmp/order.tbl') DELIMITER '|’);

insert into orders (select * from ext_orders where order_num > 10);

Example:insert into orders (select * from external ‘/tmp/orders.txt’ using(REMOUTESOURCE GZIP delimiter ‘,’));

Example: Unloading from a base table to an external tableinsert into ext_orders select * from source_table

External Table Advantages

38

• Inserts into target table are logged, unlike Db2 Load

• Constraint validation. unlike Db2 Load

• Complex expressions on data being loaded, as compared to Db2 Ingest or Db2 Load –ETL capabilities

• Does not require a Z-lock on target table

• Allows a source data file to be joined with another table before loading into target table

• Load selective rows by applying filters

• Load remote data files without any staging space – Remote streaming

• Ability to load compressed files directly and from heterogenous data sources

• Easier to integrate with external application

• Fast data load due to parallel formatters and parallel insert

• Easy diagnostics using generated Log & Bad file

Combined techniques leads to dramatic performance gain1. SQL INSERT based bulk loading2. Reduced Logging3. New faster data parser, with External Tables syntax4. Vectorized Inserts 5. Synopsis table Inserts in-memory6. Parallel Insert (SMP & MPP combined)

Achieve dramatic improvements in data ingestion• 4x faster than previous generation of Db2 LOAD utility• Load data completely online with External Tables

while running SQL queries, deletes or updates. • Run multiple bulk loads against the same table

simultaneously

Bulk Ingest Rates(higher is better)

Load

Rat

e (T

B/hr

)

4X

Dramatic Speedup on Data Ingestion

39

SQL Language Compatibility

40

Language Support

DB2 SQL & SQL PL

Oracle SQL & PL/SQL

Netezza SQL, PL/SQL, UDX

PostgreSQL SQL

R

Spark*, Scala, PySpark

JSON JAVA API

Python

JDBC/ODBC

Significant Language Extensions

• New data types: BINARY, VARBINARY and BOOLEAN

• Scalar Functions: HASH, HASH4, HASH8 and many other new scalar functions

• Regular expression scalar functions (highly sought after functions) - REGEXP_*

• Native LOB support

And many others…– LIMIT and OFFSET clause– ORDER-BY-CLAUSE NULLs can be sorted FIRST or LAST in either ascending (ASC) or descending (DESC)

order.– OLAP specification extensions. NTH_VALUE, CUME_DIST, PERCENT_RANK -– DISTINCT support on LISTAGG aggregate function– Oracle outer join syntax (+) removed from under switch (often used stand-alone feature)– OVERLAPS predicate : The OVERLAPS predicate determines whether two chronological periods overlap. A

chronological period is specified by a pair of date-time expressions (the first expression specifies the start of a period; the second specifies its end).

– CREATE FUNCTION AGGREGATE - Extend the database with user defined aggregation41

Adaptive Workload Management• Intelligent Job Scheduling

§ Cost evaluation includes memory & CPU load & time duration of queries§ Includes historical feedback based on past executions§ Scheduling based on dynamic view of resource availability in each “lane”§ Accuracy of predicted resource consumption determines queuing and performance

• Benefits§ Improved robustness under high load§ Improved SLA achievement§ Improved overall resource efficiency & throughput

• Enhancements§ Increased concurrency for CPU bound workloads

§ CPU load target optimized for Power hardware

• More efficient resource utilization for improved performance§ Improved scheduling algorithm for better utilization§ More accurate memory resource estimates for queries§ Update estimates for queued jobs based on runtime statistics§ Adjustment of resource estimates during query execution

42

Follow our plans using Aha!We revisit development priorities frequently (e.g. every quarter) in response to customer and market demand/feedback

• As a result: some items move up, some down, some in, and some out.

We have committed to keeping our core roadmaps visible to the public eye using Aha!

• http://ibm.biz/AnalyticsRoadmaps

43

Try out the new capabilities before they are released! (http://ibm.biz/DB2-EAP)

44

Leverage the IBM Db2 Developer Community Edition for small PoCs and “trial” production systems (http://Ibm.biz/Db2devc)

45

Provided as a docker-based install with Data Server Manager (DSM) and Data Studio included• Also offered as a native Db2 (only) installation: Db2 Developer-C Community Edition

Highlights:• Free, fully-functional version of Db2

• Includes all Db2 features such as compression, BLU Acceleration

• Supports all Db2 configurations including pureScale and DPF

• You can use it for development, test, or production !

Primary restrictions:• No support/warranty

• Environment limited to:

– 4 cores with 16GB of memory

– 100GB per database

Stay Current with Db2 Updates

46

Passive• Browse the list of available fix packs

– http://www-01.ibm.com/support/docview.wss?rs=71&uid=swg27007053

• Security Vulnerabilities, HIPER and Special Attention APARs fixed in DB2 for Linux, UNIX, and Windows Version 11.1

– https://www-01.ibm.com/support/docview.wss?uid=swg21994955

Proactive• Go to the IBM Support page and sign up for “My notifications”

– http://www-01.ibm.com/software/support/einfo.html

Questions?

47

48

Documents

DB2 Common SQL Engine- Whats new on Hybrid Cloud and IIAS … · 2019. 3. 6. · 2. Improve workload concurrency 3. Improve individual query and overall workload performance Some