Quarry: Digging Up the Gems of Your Data Treasury page 1
Quarry: Digging Up the Gems of Your Data Treasury page 2
Quarry: Digging Up the Gems of Your Data Treasury page 3
Quarry: Digging Up the Gems of Your Data Treasury page 4

Quarry: Digging Up the Gems of Your Data Treasury

  • View
    212

  • Download
    0

Embed Size (px)

Text of Quarry: Digging Up the Gems of Your Data Treasury

  • Quarry: Digging Up the Gems of Your Data Treasury

    Petar JovanovicUniversitat Politcnica deCatalunya, BarcelonaTech

    Barcelona, Spainpetar@essi.upc.edu

    Oscar RomeroUniversitat Politcnica deCatalunya, BarcelonaTech

    Barcelona, Spainoromero@essi.upc.edu

    Alkis SimitsisHP Labs

    Palo Alto, CA, USAalkis@hp.com

    Alberto AbellUniversitat Politcnica deCatalunya, BarcelonaTech

    Barcelona, Spainaabello@essi.upc.edu

    Hctor CandnUniversitat Politcnica deCatalunya, BarcelonaTech

    Barcelona, Spainhector.candon@est.fib.upc.edu

    Sergi NadalUniversitat Politcnica deCatalunya, BarcelonaTech

    Barcelona, Spainsnadal@essi.upc.edu

    ABSTRACTThe design lifecycle of a data warehousing (DW) system isprimarily led by requirements of its end-users and the com-plexity of underlying data sources. The process of design-ing a multidimensional (MD) schema and back-end extract-transform-load (ETL) processes, is a long-term and mostlymanual task. As enterprises shift to more real-time and on-the-fly decision making, business intelligence (BI) systemsrequire automated means for eciently adapting a physicalDW design to frequent changes of business needs. To ad-dress this problem, we present Quarry, an end-to-end sys-tem for assisting users of various technical skills in manag-ing the incremental design and deployment of MD schemataand ETL processes. Quarry automates the physical designof a DW system from high-level information requirements.Moreover, Quarry provides tools for eciently accommodat-ing MD schema and ETL process designs to new or changedinformation needs of its end-users. Finally, Quarry facili-tates the deployment of the generated DW design over anextensible list of execution engines. On-site, we will use avariety of examples to show how Quarry facilitates the com-plexity of the DW design lifecycle.

    1. INTRODUCTIONTraditionally, the process of designing a multidimensional

    (MD) schema and back-end extract-transform-load (ETL)flows, is a long-term and mostly manual task. It usuallyincludes several rounds of collecting requirements from end-users, reconciliation, and redesigning until the business needsare finally satisfied. Moreover, in todays BI systems, de-ployed DW systems, satisfying the current set of require-ments is subject to frequent changes as the business evolves.MD schema and ETL process, as other software artifacts, donot lend themselves nicely to evolution events and in general,

    c2015, Copyright is with the authors. Published in Proc. 18th Inter-national Conference on Extending Database Technology (EDBT), March23-27, 2015, Brussels, Belgium: ISBN 978-3-89318-067-7, on OpenPro-ceedings.org. Distribution of this paper is permitted under the terms of theCreative Commons license CC-by-nc-nd 4.0

    maintaining them manually is hard. First, for each new,changed, or removed requirement, an updated DW designmust go through a series of validation processes to guaran-tee the satisfaction of the current set of requirements, andthe soundness of the updated design solutions (i.e., meetingMD integrity constraints [9]). Moreover, the proposed de-sign solutions should be further optimized to meet dierentquality objectives (e.g., performance, fault tolerance, struc-tural complexity). Lastly, complex BI systems may usuallyinvolve a plethora of execution platforms, each one special-ized for eciently performing a specific analytical process-ing. Thus the ecient deployment over dierent executionsystems is an additional challenge.

    Translating information requirements into MD schema andETL process designs has been already studied, and variousworks propose either manual (e.g., [8]), guided (e.g., [1]) orautomated [2, 10, 11] approaches for the design of a DWsystem. In addition, in [4] a tool (a.k.a. Clio) is proposed toautomatically generate correspondences (i.e., schema map-pings) among dierent existing schemas, while another tool(a.k.a. Orchid) [3] further provides interoperability betweenClio and procedural ETL tools. However, Clio and Orchiddo not tackle the problem of creating a target schema. More-over, none of these approaches have dealt with automatingthe adaptation of a DW design to new information needs ofits end-users, or the complete lifecyle of a DW design.

    To address these problems, we built Quarry, an end-to-end system for assisting users in managing the complexityof the DW design lifecycle.

    Quarry starts from high-level information requirementsexpressed in terms of analytical queries that follow the well-known MD model. That is, having a subject of analysis andits analysis dimensions (e.g., Analyze the revenue from thelast years sales, per products that are ordered from Spain.).Quarry provides a graphical assistance tool for guiding non-expert users in defining such requirements using a domain-specific vocabulary. Moreover, Quarry automates the pro-cess of validating each requirement with regard to the MDintegrity constraints and its translation into MD schema andETL process designs (i.e., partial designs).

    Independently of the way end-users translate their infor-mation requirements into the corresponding partial designs,Quarry provides automated means for integrating these MDschema and ETL process designs into a unified DW designsatisfying all requirements met so far.

    549 10.5441/002/edbt.2015.55

    http://OpenProceedings.org/http://dx.doi.org/10.5441/002/edbt.2015.55

  • Figure 1: Quarry: system overview

    Quarry automates the complex and time-consuming taskof the incremental DW design. Moreover, while integrat-ing partial designs, Quarry provides an automatic valida-tion, both regarding the soundness (e.g., meeting MD in-tegrity constraints) and the satisfiability of the current busi-ness needs. Finally, for leading the automatic integration ofMD schema and ETL process designs, and creating an opti-mal DW design solution, Quarry accounts for user-specifiedquality factors (e.g., structural design complexity of an MDschema, overall execution time of an ETL process).

    Since Quarry assists both MD schema and ETL processdesigns, it also eciently supports the additional iterativeoptimization steps of the complete DW design. For example,more complex ETL flows may be required to reduce thecomplexity of an MD schema and improve the performanceof OLAP queries by pre-aggregating and joining source data.

    Besides eciently supporting the traditional DW design,the automation that Quarry provides, largely suits the needsof modern BI systems requiring rapid accommodation of adesign to satisfy frequent changes.

    Outline. We first provide an overview of Quarry andthen, we present its core features to be demonstrated. Lastly,we outline our on-site presentation.

    2. DEMONSTRABLE FEATURESQuarry presents an end-to-end system for managing the

    DW design lifecycle. Thus, it comprises four main compo-nents (see Figure 1): Requirements Elicitor, RequirementsInterpreter, Design Integrator, and Design Deployer.

    For supporting non-expert users in providing their infor-mation requirements at input, Quarry provides a graphicalcomponent, namely Requirements Elicitor (see Figure 2).Requirements Elicitor then connects to a component (i.e.,Requirements Interpreter), which for each information re-quirement at input semi-automatically generates validatedMD schema and ETL process designs (i.e., partial designs).Quarry further oers a component (i.e., Design Integra-tor) comprising two modules for integrating partial MD sch-ema and ETL process designs processed so far, and gener-ating unified design solutions satisfying a complete set ofrequirements. At each step, after integrating partial designsof a new requirement, Quarry guarantees the soundness ofthe unified design solutions and the satisfiability of all re-

    Figure 2: Requirements Elicitor

    quirements processed so far. The produced DW design solu-tions are further sent to the Design Deployer component forthe initial deployment of a DW schema and an ETL processthat populates it. The deployed design solutions are thenavailable for further user-preferred tunings and use.

    To support intra and cross-platform communication, Qua-rry uses the communication & metadata layer (see Figure 1).

    2.1 Requirements ElicitorRequirements Elicitor uses a graphical representation of a

    domain ontology capturing the underlying data sources. Adomain ontology can be additionally enriched with the busi-ness level vocabulary, to enable non-expert users to expresstheir analytical needs. Notice for example a graphical repre-sentation of an ontology capturing the TPC-H1 data sourcesin top-left part of Figure 2. Apart from manually definingrequirements from scratch, Requirements Elicitor also oersassistance to end-users data exploration tasks by analyz-ing the relationships in the domain ontology, and automat-ically suggesting potentially interesting analytical perspec-tives. For example, a user may choose the focus of an anal-

    1http://www.tpc.org/tpch/

    550

  • Deployable design solutions

    fact_table_revenue ...

    fact_table_revenue ...

    fact_table_revenue ...

    ... DATASTORE_Partsupp

    ... DATASTORE_Partsupp

    ... DATASTORE_Partsupp...

    Partial designs

    fact_table_revenue ...

    fact_table_revenue ...

    fact_table_netprofit ...

    ... DATASTORE_Partsupp

    ... DATASTORE_Partsupp

    ... DATASTORE_Nation...

    ...

    ...

    IR 1

    IR N

    IR 1

    IR N

    MD schema (xMD)

    ETL process (xLM)

    Unified design solutions (IR 1 IR N)

    fact_table_revenue ...

    fact_table_revenue ...

    fact_table_revenue