2
Extending Database Accelerators for Data Transformations and Predictive Analytics Knut Stolze IBM Germany Research & Development GmbH Schönaicher Straße 220 71032 Böblingen, Germany [email protected] Felix Beier IBM Germany Research & Development GmbH Schönaicher Straße 220 71032 Böblingen, Germany [email protected] Daniel Martin IBM Germany Research & Development GmbH Schönaicher Straße 220 71032 Böblingen, Germany [email protected] ABSTRACT The IBM DB2 Analytics Accelerator (IDAA) integrates the strong OLTP capabilities of DB2 for z/OS with very fast processing of OLAP workloads using Netezza technology. The accelerator is attached to DB2 as analytical process- ing resource – completely transparent for user applications. But all data modifications must be carried out by DB2 and are replicated to the accelerator internally. However, this behavior is not optimized for ELT processing and predic- tive analytics or data mining workloads where multi-staged data transformations are involved. We present our work for extending IDAA with accelerator-only tables, which enable direct data transformations without any necessary interven- tions by DB2. Further, we present a framework for executing arbitrary in-database analytics operations on the accelerator while ensuring data governance aspects like privilege man- agement on DB2 and allowing to ingest data from any other source directly to the accelerator to enrich analytics e. g., with social media data. The evolutionary framework design maintains compatibility with existing infrastructure and ap- plications, a must-have for the majority of customers, while allowing complex analytics beyond read-only reporting. Keywords analytics, data mining, db2, mainframe, idaa 1. INTRODUCTION The IBM DB2 Analytics Accelerator (IDAA) [1] is an ex- tension for IBM’s R DB2 R for z/OS R database system. It’s primary objective is the extremely fast execution of com- plex, analytical queries on a snapshot of the data copied from DB2. However, when it comes to more complex, multi- staged data analysis tasks like data mining, the accelerator can often provide limited improvements only. Predictive an- alytics tools like SPSS [4] resort to multiple SQL statements, each implementing a step or stage in a chain of data prepara- tion, transformation, and evaluation tasks. For each stage, c 2016, Copyright is with the authors. Published in Proc. 19th Inter- national Conference on Extending Database Technology (EDBT), March 15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 base data needs to be transferred to IDAA before mining algorithms can be run and result data has to be materi- alized within DB2 before it can be used as input for the next stage or iteration. A key requirement for enhancing these workloads is to minimize data movement while still exploiting the accelerator for this task. We solve this issue with accelerator-only tables (AOTs) which are discussed in Sec. 2. The second use case is the application of the an- alytic algorithms in the pipeline. A generic framework is required which allows to pass code for arbitrary algorithms to IDAA while still implementing data governance aspects correctly. A seamless approach, completely transparent to user applications is discussed in Sec. 3. 2. ACCELERATOR-ONLY TABLES AND DATA INGESTION Maintaining a copy of the DB2 data in the accelerator for query processing is the main use case of IDAA. However, this design involves a lot of data movements in case of multi- staged algorithms that require result materialization inside DB2 before the next step can be executed. The first build- ing block in our current efforts are accelerator-only tables (AOTs), i. e., tables whose data solely resides inside IDAA (cf. Fig. 1), and DB2 only keeps a proxy or table reference which is usually named nickname in federation contexts [5]. This proxy is used for storing meta data in the DB2 catalog and acts as indicator for delegating any query on the cor- responding AOT to IDAA. For creating AOTs the CREATE TABLE statement was extended by an additional IN ACCEL- ERATOR clause. AOTs are populated with INSERT statements comprising a list of values or a sub-select which might in- voke arbitrary transformation procedures (cf. Sec. 3), re- trieving the data from other regular accelerated tables or AOTs. Likewise, UPDATEs and DELETEs are handled. The second way for populating AOTs is the new IDAA Loader [2] (cf. Fig. 1). The data to be loaded can originate from a variety of sources, even from applications not running on System z which opens up a wide range of new use cases. Data can be ingested in both, regular DB2 tables and AOTs. In the past, IDAA was not concerned about transactions be- cause only the cursor stability isolation level was supported. Queries were executed under snapshot isolation in Netezza. With AOTs, IDAA has to be aware of the DB2 transaction context so that correct results are guaranteed, i. e., uncom- mitted data modifications of the own transaction are han- dled. At the same time, concurrent execution of multiple queries in a single transaction are also supported. Poster Paper Series ISSN: 2367-2005 706 10.5441/002/edbt.2016.97

Extending Database Accelerators for Data Transformations ...openproceedings.org/2016/conf/edbt/paper-339.pdf · Extending Database Accelerators for Data Transformations ... and DB2

Embed Size (px)

Citation preview

Page 1: Extending Database Accelerators for Data Transformations ...openproceedings.org/2016/conf/edbt/paper-339.pdf · Extending Database Accelerators for Data Transformations ... and DB2

Extending Database Accelerators for Data Transformationsand Predictive Analytics

Knut StolzeIBM Germany Research &

Development GmbHSchönaicher Straße 220

71032 Böblingen, [email protected]

Felix BeierIBM Germany Research &

Development GmbHSchönaicher Straße 220

71032 Böblingen, [email protected]

Daniel MartinIBM Germany Research &

Development GmbHSchönaicher Straße 220

71032 Böblingen, [email protected]

ABSTRACTThe IBM DB2 Analytics Accelerator (IDAA) integrates thestrong OLTP capabilities of DB2 for z/OS with very fastprocessing of OLAP workloads using Netezza technology.The accelerator is attached to DB2 as analytical process-ing resource – completely transparent for user applications.But all data modifications must be carried out by DB2 andare replicated to the accelerator internally. However, thisbehavior is not optimized for ELT processing and predic-tive analytics or data mining workloads where multi-stageddata transformations are involved. We present our work forextending IDAA with accelerator-only tables, which enabledirect data transformations without any necessary interven-tions by DB2. Further, we present a framework for executingarbitrary in-database analytics operations on the acceleratorwhile ensuring data governance aspects like privilege man-agement on DB2 and allowing to ingest data from any othersource directly to the accelerator to enrich analytics e. g.,with social media data. The evolutionary framework designmaintains compatibility with existing infrastructure and ap-plications, a must-have for the majority of customers, whileallowing complex analytics beyond read-only reporting.

Keywordsanalytics, data mining, db2, mainframe, idaa

1. INTRODUCTIONThe IBM DB2 Analytics Accelerator (IDAA) [1] is an ex-

tension for IBM’sR© DB2R© for z/OSR© database system. It’sprimary objective is the extremely fast execution of com-plex, analytical queries on a snapshot of the data copiedfrom DB2. However, when it comes to more complex, multi-staged data analysis tasks like data mining, the acceleratorcan often provide limited improvements only. Predictive an-alytics tools like SPSS [4] resort to multiple SQL statements,each implementing a step or stage in a chain of data prepara-tion, transformation, and evaluation tasks. For each stage,

c©2016, Copyright is with the authors. Published in Proc. 19th Inter-national Conference on Extending Database Technology (EDBT), March15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro-ceedings.org. Distribution of this paper is permitted under the terms of theCreative Commons license CC-by-nc-nd 4.0

base data needs to be transferred to IDAA before miningalgorithms can be run and result data has to be materi-alized within DB2 before it can be used as input for thenext stage or iteration. A key requirement for enhancingthese workloads is to minimize data movement while stillexploiting the accelerator for this task. We solve this issuewith accelerator-only tables (AOTs) which are discussed inSec. 2. The second use case is the application of the an-alytic algorithms in the pipeline. A generic framework isrequired which allows to pass code for arbitrary algorithmsto IDAA while still implementing data governance aspectscorrectly. A seamless approach, completely transparent touser applications is discussed in Sec. 3.

2. ACCELERATOR-ONLY TABLES ANDDATA INGESTION

Maintaining a copy of the DB2 data in the accelerator forquery processing is the main use case of IDAA. However,this design involves a lot of data movements in case of multi-staged algorithms that require result materialization insideDB2 before the next step can be executed. The first build-ing block in our current efforts are accelerator-only tables(AOTs), i. e., tables whose data solely resides inside IDAA(cf. Fig. 1), and DB2 only keeps a proxy or table referencewhich is usually named nickname in federation contexts [5].This proxy is used for storing meta data in the DB2 catalogand acts as indicator for delegating any query on the cor-responding AOT to IDAA. For creating AOTs the CREATE

TABLE statement was extended by an additional IN ACCEL-

ERATOR clause. AOTs are populated with INSERT statementscomprising a list of values or a sub-select which might in-voke arbitrary transformation procedures (cf. Sec. 3), re-trieving the data from other regular accelerated tables orAOTs. Likewise, UPDATEs and DELETEs are handled. Thesecond way for populating AOTs is the new IDAA Loader[2] (cf. Fig. 1). The data to be loaded can originate froma variety of sources, even from applications not running onSystem z which opens up a wide range of new use cases.Data can be ingested in both, regular DB2 tables and AOTs.In the past, IDAA was not concerned about transactions be-cause only the cursor stability isolation level was supported.Queries were executed under snapshot isolation in Netezza.With AOTs, IDAA has to be aware of the DB2 transactioncontext so that correct results are guaranteed, i. e., uncom-mitted data modifications of the own transaction are han-dled. At the same time, concurrent execution of multiplequeries in a single transaction are also supported.

Poster Paper

Series ISSN: 2367-2005 706 10.5441/002/edbt.2016.97

Page 2: Extending Database Accelerators for Data Transformations ...openproceedings.org/2016/conf/edbt/paper-339.pdf · Extending Database Accelerators for Data Transformations ... and DB2

DB2 z/OS IDAA

Figure 6: SPSS Modeler using Accelerator-Only Tables

wrapper stored procedure in DB2, which (when called) connects to IDAA, which in turncalls the Netezza stored procedure.

In our prototype, we have implemented 5 such stored procedures. The Netezza Analyticspackage has many more, and most of them will finally be made available once this featurebecomes part of a future version of the IBM DB2 Analytics Accelerator.

4.1 Evaluation

In order to test the validity of the integration of analytics functionality into IDAA, weconducted some basic performance tests. In one scenario (A), SPSS was installed on ona zLinux partition on System z and connected to DB2 for z/OS through hyperlink sock-ets. This approach is already very much optimized and minimizes any possible networklatency between SPSS and DB2. We have to point out that in such a (today very typical)setup, SPSS tries to push down as many predicates and joins to the data source (DB2) aspossible, but it will compensate joins, filtering, etc. if necessary (e. g. when merging datafrom different data sources). If such a compensation happens, SPSS transfers all relevantdata from the data source to its own memory and computes the result locally. The ana-lytics algorithm (k-means in our example) runs in SPSS – not in DB2 or in DB2 storedprocedures.

The second setup (B) is the important scenario for our in-accelerator analytics processing.All steps, including the execution of the k-means clustering, are executed in the accelerator.

SPSS Modeler

imagecopy

VSAMIMS

Oracle

External Data Sources

IDAA Loader Regular

Table

Table Proxy

Table Proxy

RegularTable

AOT

AOT

Table Proxy AOT

tablereference

tablecopy

insertupdatedelete

(+storedprocedure)

dataingestion

AOTgeneration

Figure 1: Overview IDAA with Accelerator-Only Tables

3. IN-ACCELERATOR ANALYTICSIDAA has now a framework for invoking generic, customer-

specific stored procedures (SPs). A SP can be called inDB2z, and IDAA forwards the procedure execution to theNetezza backend (cf. Fig. 2 and Fig. 3). We integrated theNetezza analytics package [3] which is used by SPSS [4].The SPSS modeler provides means to define nodes, to cre-ate and populate AOTs, e. g., by joining accelerated tables,and then to invoke an analytics SP running in IDAA. Sucha scenario is illustrated in Fig. 1 where the k-means clus-tering algorithm is applied on some filtered, enriched, andsampled input data. Intermediate results are materializedinto AOTs and finally fed to the k-means SP. However, twochallenges occur here. First, user privileges need to be veri-fied for all data sources referenced by such a black box SP.Therefore, IDAA resolves all dependencies without execut-ing the SP code. These table references are passed to DB2which in turn validates necessary user privileges before theactual operation is carried out. The second issue arises fromviews existing on the DB2 side, which are not present on theaccelerator. The framework exploits DB2’s query accelera-tion capabilities to extract the view definition and implicitlytranslate it to the Netezza SQL dialect. A temporary viewis created in the Netezza backend using that definition.

4. RELATED WORKUsing accelerator-only tables is similar to the concept of

pass-through functionality in federated systems based onSQL/MED [5]. However, the overall use case for AOTs isquite different. AOTs are conceptually DB2 tables and canbe manipulated using DB2’s SQL dialect. It is just thatAOTs store all their data in the analytics-optimized queryengine and not in the transactional storage engine of DB2

DB2 z/OS

CALLprocedure

executeDB2

procedure

Procedure Module

access tables(SELECT,CREATE,INSERT,UPDATE,DELETE)

Query Engine IDAA

Netezza

possiblyoffload queryClient

Application

Figure 2: Stored Procedure Execution in DB2

for z/OS itself. Contrary to that, federated systems pro-vide a means to access tables and data residing in another,stand-alone database system. The integration of the differ-ent query engines is on a much deeper level in IDAA.

MySQL provides an internal interface to plugin differentstorage engines with different characteristics [6]. IDAA inDB2 for z/OS implements a similar architecture by combin-ing DB2’s and Netezza’s characteristics and the respectiveunderlying storage mechanisms. However, the integration ofboth happens on the SQL dialect and SQL optimizer layerand not on the buffer pool and storage layers.

5. TRADEMARKSIBM, DB2, and z/OS are trademarks of International Busi-

ness Machines Corporation in USA and/or other countries.Other company, product or service names may be trade-marks, or service marks of others. All trademarks are copy-right of their respective owners.

6. REFERENCES[1] P. Bruni et al. Reliability and Performance with IBM

DB2 Analytics Accelerator V4.1. IBM Redbooks, 2014.

[2] IBM. DB2 Analytics Accelerator Loader for z/OS, 2014.http://www-03.ibm.com/software/products/en/db2-analytics-accelerator-loader-for-zos.

[3] IBM. IBM Netezza Analytics – In-Database AnalyticsDeveloper’s Guide, Release 3.0.1, 2014.

[4] IBM. SPSS Software, 2014.http://www.ibm.com/software/analytics/spss/.

[5] ISO/ IEC 9075-9:2003. Information Technology –Database Languages – SQL – Part 9: Management ofExternal Data (SQL/ MED), 2nd edition, 2003.

[6] A. Lentz. MySQL Storage Engine Architecture. MySQLDeveloper Articles, MySQL AB, May, 2004.

DB2 z/OS

CALLNetezza

procedure

access tables(SELECT,CREATE,INSERT,UPDATE,DELETE)

Query Engine IDAA

Netezza

Procedure Module

delegate CALL

CALLprocedure

Client Application

Figure 3: Stored Procedure Execution in Netezza

707