Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
© 2019 IBM Corporation
Db2 Common SQL Engine Enhancements Update: What's new in Db2 Hybrid cloud and IIASVincent Kulandaisamy ([email protected])https://www.linkedin.com/in/vincentkulandaisamy
Mar 06 2019
CopyrightIBMCorporation2019. Allrightsreserved. U.S.GovernmentUsersRestrictedRights- use,duplication,ordisclosurerestrictedbyGSAADPScheduleContractwithIBMCorporation.
•IBM,theIBMlogo,andibm.com aretrademarksorregisteredtrademarksofInternationalBusinessMachinesCorporationintheUnitedStates,othercountries,orboth. IftheseandotherIBMtrademarkedtermsaremarkedontheirfirstoccurrenceinthisinformationwithatrademarksymbol(®orTM),thesesymbolsindicateU.S.registeredorcommonlawtrademarksownedbyIBMatthetimethisinformationwaspublished. Suchtrademarksmakealsoberegisteredorcommonlawtrademarksinothercountries.AcurrentlistofIBMtrademarksisavailableontheWebat“Copyrightandtrademarkinformationat: ibm.com/legal/copytrade/shtml.•Theinformationcontainedinthispresentationisprovidedforinformationalpurposeonly. Whileeffortsweremadetoverifythecompletenessandaccuracyoftheinformationcontainedinthispresentation,itisprovided“asis”withoutwarrantyofanykind,expressedorimplied.IBMshallnotberesponsibleforanydamagesarisingoutoftheuseof,orotherwiserelatedto,thispresentationoranyotherdocumentation.•Theinformationmentionedregardingpotentialfutureproductsisnotacommitment,promise,orlegalobligationtodeliveranymaterial,codeorfunctionality.Informationaboutpotentialfutureproductsmaynotbeincorporatedintoanycontract.Nothingcontainedinthispresentationisintendedto,norshallhavetheeffectof,creatinganywarrantiesorrepresentationsfromIBM(oritssuppliers orlicensors),oralteringthetermsandconditionsofanyagreementorlicensegoverningtheuseofIBMproductsand/orsoftware.•AnystatementsofperformancearebasedonmeasurementsandprojectionsusingstandardIBMbenchmarksinacontrolledenvironment.Theactualthroughputorperformancethatanyuserwillexperiencewillvarydependinguponmanyfactors,includingconsiderationssuchastheamountofmulti-programmingintheuser’sjobstream,theI/Oconfiguration,thestorageconfiguration,andtheworkloadprocessed.Therefore,noassurancecanbegiventhatanindividualuserwillachieveresultssimilartothosestated.•IBM’sstatementsregardingitsplans,directions,andintentaresubjecttochangeorwithdrawalwithoutnoticeatIBM’ssole discretion.Thedevelopment,release,andtimingofanyfuturefeaturesorfunctionalitydescribedforourproductsremainsatoursolediscretion.Informationregardingpotentialfutureproductsisintendedtooutlineourgeneralproductdirectionanditshouldnotberelied oninmakingapurchasingdecision.”
SafeHarborStatementandDisclaimer
2
IBM ©2018 IBM Corporation
Db2 on Cloud—Fully-managed, cloud transactional data store
Integrated Analytics System—Dedicated analytics appliance
Db2 & Db2 Warehouse—Transactional or analytics SQL database deployed on commodity hardware
Db2 Big SQL—Open source Hadoop with Hortonworks
Our family of Hybrid Data
Management solutions
built on the Db2common SQL
engine
Write your SQL once
deploy against any form factor
run anywhere
Cloud
Db2 SQL Engine
Db2 Hosted—Not managed - we install Db2 and hand the keys over to you
Cloud
Hosted Analytics with Hortonworks—Hosted Hadoop deployment with Big SQL and Data Science Experience
Cloud
Db2 Warehouseon Cloud—Fully-managed, cloud data warehouse
Cloud
3
Offers clients choice in selecting the best (combination of) data stores to satisfy hybrid data warehouse solution needs
BuiltonacommonandfluidanalyticsSQLengineenablingtruehybriddatasolutionswithportableanalytics
Db2WarehouseOnCloud
IntegratedAnalyticsSystem
Db2 BigSQLDb2Warehouse
Common Skill SetOne skill set for all deploymentDrive Higher efficiencies and portfolio rationalization
Operational compatibility Reuse operational and housekeeping procedures
Next-gen analytics Common programming model for in-DB analytics
Data VirtualizationCommon Fluid Query capabilities for query federation and data movement
Application AgilityWrite once, run anywhereOne ISV product certification for all platforms
The Hybrid DW Strategy: Write once, run anywhere
4
Licensing Flexible entitlements for business agility & cost-optimization
DBaaS(MANAGED)
DBaaS(HOSTED)PaaSPRIVATE
CLOUDAPPLIANCEdocker
Db2 Warehouse
Db2: Flexibility of Deployment
5
On PremiseOn Cloud
SOFTWARE
CONTROL SIMPLICITY
Db2 Warehouse
IBM Integrated Analytics System Db2 Hosted Db2 on Cloud
Db2 Warehouse on Cloud
BLUAccelerationNewEraofSmart
BLU Acceleration
6
IBM Research & Development Lab Innovations
BLU Acceleration
Analyze born on the Cloud data setsYour mobile apps, IoT devices, Web apps are all generating volumes of data. Stick that data in a Cloud data warehouse.
IBM ©2018 IBM Corporation 7
Offload your analyticsTransactional databases are optimized for transactional workloads. Offload your big data projects to a database optimized for the job.
Modernize your data warehouseTransform your data warehouse with a highly performant, in-memory/columnar, in-database analytics and simplify technology processes with data virtualization (federation), scalability and workload portability.
Extend your on-premises data store to the CloudCloud brings the promise of elasticity, performance, cost savings and business agility. Take advantage of it.
Helping machinery operators around the world predict and prevent equipment failureLearn more
Analyzing data from wi-fi access points to enhance tourist recommendationsLearn more
Performing advanced analytics on sales data to drive customer engagementLearn more
Training machine learning and deep learning models to personalize shopping experiencesLearn more
Analyzing global viewership data to optimize content delivery
Centralizing operational data to accelerate reporting and business planningLearn more
Read more: ibm.com/cloud/db2-warehouse-on-cloud
Our Db2 Warehouse Cloud footprint keeps growing…Here’s what some of our public reference customers are doing today
IBM ©2018 IBM Corporation
KONE expects to ease its own maintenance efforts by connecting the KONE global maintenance base of more than 1.1 million elevators and escalators to the IBM Watson technology-driven, cloud-based system that helps predict maintenance needs before equipment can fail. Technicians have more information at their fingertips when they arrive at a service call, which speeds problem resolution and gets equipment back online quicker.
“From a pure technology standpoint, the heart and soul of everything we do at RSG Media is IBM Db2 Warehouse on Cloud. We have benchmarked it against leading competitors such as Snowflake and Amazon Redshift, and Db2 Warehouse on Cloud gives us the best price-performance ratio.”—Shiv Sehgal, CEO, Media Mantra at RSG Media
“Moving to cloud has reduced our on-premises hardware footprint, and we have been able to cut IT operational costs by 35 percent. At the same time, we have escaped the capacity limitations of the on-premises model, and can scale our cloud environment up to run more reports in parallel when we need to.”—Vimal Dev, Vice President – IT, Global Enterprise Applications Leader, Genpact
“The response from IBM was amazing—they were so positive about the idea and immediately looked into how they could best support this initiative. We talked to other AI companies as well, but IBM’s response was the most impressive by far.”—Tanja Hoel, Director, Seafood Innovation Cluster
Here’s what our customers are saying about us…
Think 2019 / DOC ID / February, 2019 / © 2019 IBM Corporation8
Performance Improvements in Db2 Columnar Engine (a.k.aDb2 BLU)
9
Goals1. Reduce overall memory footprint and demand on system resources and improve
utilization2. Improve workload concurrency3. Improve individual query and overall workload performance
Some of the note worthy enhancements in the following category were delivered to Db2 Warehouse, Db2 Warehouse on Cloud, IDAA and IIAS
1. Rewrite Optimizations2. Optimizer Improvements3. Runtime Improvements4. Compression and Data Movement5. SQL Compatibility6. WLM
These and many other features will be available in future Db2 on-premises releases eventually.
Rewrite Optimizations
1. Reduce UNION ALL over disjoint branches
2. Transform UNION over DUAL tables to VALUES
3. Perform VALUES Cartesian joins at query rewrite
4. Push down/fold COALESCE/NVL into the VALUES
5. Enforce early DISTINCT (Distinct to Group BY)
6. Eliminate redundant TO_DATE casting functions
10
1. UNION ALL Reduction over Disjoint Branches
• Transforms UNION ALL over disjoint branches into a simple SELECT with CASE expressions
• Significant compile time reduction
11
2. UNION over DUAL tables to VALUES
• Transforms UNION over single-row DUAL tables into a multi-row Values operation
• Significant compile time reduction
12
3. VALUES Cartesian Join in Query Rewrite• Pre-computes Cartesian joins between Values box in Rewrite• Avoids multiple NLJNs with the fact table
13
4. Push COALESCE/NVL into the VALUES box• Pushes down and folds COALESCE/NVL scalar function into the VALUES box
• Without a COALESCE pushed down, we could get a NLJN instead of HSJN if the Values box and the Coalesce box were split in the plan.
• Improves selectivity estimates when COALESCE was referenced in predicate
14
5. Enforce Early Distinct/Distinct to Group By• Keeps the DISTINCT in-place rather than let the rewrite engine pull it up in the graph to
be merged into other operations; to favor CSEs, replace DISTINCT by GROUP BY operator• Significant benefits by removing duplicates early• Avoids manual rewrites in the queries and views
15
6. Remove Redundant TO_DATE Casting Function• TO_DATE is redundant when applied on a DATE type column• Depending on the context (e.g. predicates, select list), the function can be replaced by the
source column or TIMESTAMP()• Provides better selectivity estimates when removed from predicates• Reduction in execution time as TO_DATE is expensive
16
Optimizer Improvements
17
1. Smart statistics collection on MPP
2. Local sort for insert-subselect with order by
1. Smarter Statistics Collection on MPP
Old Behavior- Table statistics extrapolated from first partition in the table’s partition group
New Behavior- Table statistics extrapolated from first non empty table partition
– Improves robustness, performance by better query optimizer planning
18
MyTablerows: 0
MyTablerows: 0
MyTablerows: 12
Partition 1 Partition 2 Partition 3
2. Local Sort for Insert-Subselect with Order-By
High Level
- Identify intra-partition data movement - Avoid needless processing by coordinator- Avoid FCM and TQ processing
Applicability
- CTAS or Insert From Select Scenarios with Order By- Both Source and Target data must be on the same partition
Value Proposition
- Significantly improves table sorting and ETL operations
19
2. Local Sort for Insert-Subselect with Order-By (Cont’d)
Without Local Sort With Local Sort
20
Source Data Nodes
Coordinator Node
Target Data Nodes
Source Data Nodes
Target Data Nodes
• During an Insert-Subselect with Order-by, the sorting for the order-by can be done locally without sending rows to the coordinator node.
• E.g. insert into t1 select * from t2 order by c1;
Runtime Improvements
21
1. Efficient sorting of skewed data
2. Improved memory utilization on GROUP BY with skewed data
3. Improved memory utilization on aggregation distinct with wide columns
4. Hash join Residual Predicates support
5. Efficient system resources utilization on complex queries
1. Efficient Sorting of Skewed Data
22
• Depending on data characteristics, heavily skewed data might create data chunks that have majority duplicate and minority distinct values
New enhancement: Efficient sorting with significantly less memory • Identifies majority duplicate value and skips sorting duplicates• Sorts only minority distinct values• Emits the values in the right sorted order
• Improved SQL ORDER BY performance
Duplicate value chunkMinority distinct values chunk
Data chunk
2. Improved Memory Utilization on GROUP BY
23
• Depending on data characteristics, heavily skewed data might create highly imbalanced distribution across the chunks when partitioning the data during GROUP BY processing
– leads to higher memory footprint during grouping and aggregation
New enhancement
• Improved parallelism on grouping/aggregation and spilling data chunks when available memory is constrained
• Significantly reduces memory footprint and improves performance
3. Improved Memory Utilization on Aggregation Distinct
24
• Aggregation distinct on wider columns significantly increases memory footprint and affects query performance
• Example:create table T (store_id BIGINT, prodname VARCHAR(2000), storename VARCHAR(5000)…) organize
by column;select store_id, count (distinct prodname), count(distinct storename) from T group by store_id;
• actual width of the tuples may be significantly shorter than schema column width
New enhancement
• Varying length values are stored in compact form during aggregation distinct processing
• significant memory reduction!
• improved aggregation distinct performance when actual width is smaller than schema column width
• New enhancement is available under a registry knob
4. Hash Join Residual Predicates Support
25
• What are join residual predicates?• Predicates applied against non join key column(s)
• For outer and anti-joins, predicates must reference row preserving side
• For inner joins, predicates must reference both sides of the join tables
• Can be simple equality, range predicates or expressions
4. Hash Join Residual Predicates Support (Cont’d)
26
INNER JOIN (IJ)
LEFT OUTER JOIN (LOJ)
RIGHT OUTER JOIN (ROJ)
LEFT ANTI JOIN (LAJ)
RIGHT ANTI JOIN (RAJ)
Necessary conditions:- IJ: Must reference both the fact and dimension sides.- OJ/AJ: Must reference at least the row preserving
side.- LOJ/LAJ: Fact side is row preserving- ROJ/RAJ: Dimension side is row preserving
Equality JOIN predicates, Residual predicates and Local predicates together return the results for a given JOIN
select * from t1 left outer join t2 on t1.c1= t2.c1 ç JOIN_PREDand t1.c2 <= 5 ç RESID_PREDand t1.c3 LIKE ‘%mario%’ ç RESID_PREDand t2.c4 IN (10, 100, 1000) ç LOCAL PRED
4. Hash Join Residual Predicates Support (Cont’d)
27
4. Hash Join Residual Predicates Support (Cont’d)Supported Predicates
• LIKE, IN, ISNULL, ISNOTNULL, EQ, NEQ, GEQ, LEQ, LT, GT, BETWEEN, OVERLAPS
Residual predicates with expressions:
• Implicit casts: e.g., t1.c1 < t2.c1 // t1.c1 and t2.c1 are different comparable data types
• Explicit casts: e.g., CAST (t1.c1 as BIGINT) < t2.c1
• Scalar functions: e.g., MOD(t1.c3, 2) < t2.c3
• Arithmetic expressions: e.g., t1.c1 + 5 > t2.c1
• Any combination of above
Not supported:
• Predicates where a single operand references both sides of JOIN. e.g., t1.c1 + t2.c1 <resid_pred> op2
28
4. Hash Join Residual Predicates Performance
29
NOTE: Performance results based on internal targeted workloads. Individual mileage may vary
4. Hash Join Residual Predicates Performance (Cont’d)
30
NOTE: Performance results based on internal targeted workloads. Individual mileage may vary
5. Efficient Resource Utilization on Complex Queries
31
• Efficient reuse/repurpose of the Db2 system agents and resources for executing different query blocks / subsections of a complex query plan
• Significantly reduces number of active agents and system resource utilization
select 't1' as tablename, count(*) from t2 where c1 = 0union allselect 't2' as tablename, count(*) from t2 where c1 = 0union allselect 't3' as tablename, count(*) from t3 where c1 = 0. union allselect 't4' as tablename, count(*) from t4 where c1 = 0
Example 1
5. Efficient Resource Utilization on Complex Queries (Cont’d)
32
select col1, count(distinct col3), count(distinct col4), count(distinct col6), count(distinct col7), count(distinct col8), count(distinct col9), count(distinct col10), count(distinct col11), count(distinct col12) from ua100k group by col1
Example 2
Compression & Data Movement
33
Compression Enhancements1. Enhancements to SQL-based insert and update statements
• Significant optimizations to process large volume of data more efficiently and faster
• Optimized and improved compression
2. REORG TABLE enhancement to further improve compression
• Recompress the initial data inserted before the creation of the dictionary
• After dictionary is built, initial data is recompressed with new dictionary
• Completely automated so compression should improve with no user intervention
34
Un-encodedData EncodedData
EncodedData
Bulk Insert & Update Performance§ SMP parallelism : parallelized the INSERT processing to use multiple
cores/threads§ Vectorization of the runtime of BLU bulk insert§ Log reservation§ Encoding improvement§ Code path optimization
• Bulk insert is around 4X faster compared to Db2 v11.1!
35
Reduced Logging
Reduced logging improvements:– Consists of two parts, reduced undo logging and
reduced redo logging.– Both parts share the following:
• Applies only to column organized tables.• Applies to any bulk load operation which drives
insert internally. (e.g. Insert from subselect, Update, Merge, etc.)
Reduced Undo logging improvements– Avoids undo processing and logging for data page
contents.– No new functional restrictions/limitations.– Results in significantly less active log space needed
for bulk insert (~40% less in our testing).
Reduced Redo logging improvements:– Log meta data changes, but skip logging of page
contents.– Similar to NLI tables, but with improved
recoverability and concurrency.
Table contents will be preserved during:– Rollback– Crash recovery– Recovery to the time of the last backup
Table contents are preserved to end of backup for any online backup.
Total impact: 95% reduction in required log space
External Tables
37
A simple mechanism to treat external “files” as a database table using SQL statement
• Can also be used to load from or unload to external files
• Can be used to define a permanent external table or directly within a SQL statement
• Currently supports GZIP and uncompressed file formats
• Sources can be local/remote sources including object stores Swift and AWS S3
Example:create external table ext_orders(order_num INT, order_dt TIMESTAMP)
USING(dataobject('/tmp/order.tbl') DELIMITER '|’);
insert into orders (select * from ext_orders where order_num > 10);
Example:insert into orders (select * from external ‘/tmp/orders.txt’ using(REMOUTESOURCE GZIP delimiter ‘,’));
Example: Unloading from a base table to an external tableinsert into ext_orders select * from source_table
External Table Advantages
38
• Inserts into target table are logged, unlike Db2 Load
• Constraint validation. unlike Db2 Load
• Complex expressions on data being loaded, as compared to Db2 Ingest or Db2 Load –ETL capabilities
• Does not require a Z-lock on target table
• Allows a source data file to be joined with another table before loading into target table
• Load selective rows by applying filters
• Load remote data files without any staging space – Remote streaming
• Ability to load compressed files directly and from heterogenous data sources
• Easier to integrate with external application
• Fast data load due to parallel formatters and parallel insert
• Easy diagnostics using generated Log & Bad file
Combined techniques leads to dramatic performance gain1. SQL INSERT based bulk loading2. Reduced Logging3. New faster data parser, with External Tables syntax4. Vectorized Inserts 5. Synopsis table Inserts in-memory6. Parallel Insert (SMP & MPP combined)
Achieve dramatic improvements in data ingestion• 4x faster than previous generation of Db2 LOAD utility• Load data completely online with External Tables
while running SQL queries, deletes or updates. • Run multiple bulk loads against the same table
simultaneously
Bulk Ingest Rates(higher is better)
Load
Rat
e (T
B/hr
)
4X
Dramatic Speedup on Data Ingestion
39
SQL Language Compatibility
40
Language Support
DB2 SQL & SQL PL
Oracle SQL & PL/SQL
Netezza SQL, PL/SQL, UDX
PostgreSQL SQL
R
Spark*, Scala, PySpark
JSON JAVA API
Python
JDBC/ODBC
Significant Language Extensions
• New data types: BINARY, VARBINARY and BOOLEAN
• Scalar Functions: HASH, HASH4, HASH8 and many other new scalar functions
• Regular expression scalar functions (highly sought after functions) - REGEXP_*
• Native LOB support
And many others…– LIMIT and OFFSET clause– ORDER-BY-CLAUSE NULLs can be sorted FIRST or LAST in either ascending (ASC) or descending (DESC)
order.– OLAP specification extensions. NTH_VALUE, CUME_DIST, PERCENT_RANK -– DISTINCT support on LISTAGG aggregate function– Oracle outer join syntax (+) removed from under switch (often used stand-alone feature)– OVERLAPS predicate : The OVERLAPS predicate determines whether two chronological periods overlap. A
chronological period is specified by a pair of date-time expressions (the first expression specifies the start of a period; the second specifies its end).
– CREATE FUNCTION AGGREGATE - Extend the database with user defined aggregation41
Adaptive Workload Management• Intelligent Job Scheduling
§ Cost evaluation includes memory & CPU load & time duration of queries§ Includes historical feedback based on past executions§ Scheduling based on dynamic view of resource availability in each “lane”§ Accuracy of predicted resource consumption determines queuing and performance
• Benefits§ Improved robustness under high load§ Improved SLA achievement§ Improved overall resource efficiency & throughput
• Enhancements§ Increased concurrency for CPU bound workloads
§ CPU load target optimized for Power hardware
• More efficient resource utilization for improved performance§ Improved scheduling algorithm for better utilization§ More accurate memory resource estimates for queries§ Update estimates for queued jobs based on runtime statistics§ Adjustment of resource estimates during query execution
42
Follow our plans using Aha!We revisit development priorities frequently (e.g. every quarter) in response to customer and market demand/feedback
• As a result: some items move up, some down, some in, and some out.
We have committed to keeping our core roadmaps visible to the public eye using Aha!
• http://ibm.biz/AnalyticsRoadmaps
43
Try out the new capabilities before they are released! (http://ibm.biz/DB2-EAP)
44
Leverage the IBM Db2 Developer Community Edition for small PoCs and “trial” production systems (http://Ibm.biz/Db2devc)
45
Provided as a docker-based install with Data Server Manager (DSM) and Data Studio included• Also offered as a native Db2 (only) installation: Db2 Developer-C Community Edition
Highlights:• Free, fully-functional version of Db2
• Includes all Db2 features such as compression, BLU Acceleration
• Supports all Db2 configurations including pureScale and DPF
• You can use it for development, test, or production !
Primary restrictions:• No support/warranty
• Environment limited to:
– 4 cores with 16GB of memory
– 100GB per database
Stay Current with Db2 Updates
46
Passive• Browse the list of available fix packs
– http://www-01.ibm.com/support/docview.wss?rs=71&uid=swg27007053
• Security Vulnerabilities, HIPER and Special Attention APARs fixed in DB2 for Linux, UNIX, and Windows Version 11.1
– https://www-01.ibm.com/support/docview.wss?uid=swg21994955
Proactive• Go to the IBM Support page and sign up for “My notifications”
– http://www-01.ibm.com/software/support/einfo.html
Questions?
47
48