16
The iBench Integration Metadata Generator Patricia C. Arocena University of Toronto [email protected] Boris Glavic Illinois Institute of Technology [email protected] Radu Ciucanu University of Oxford [email protected] Ren ´ ee J. Miller University of Toronto [email protected] ABSTRACT Given the maturity of the data integration field it is sur- prising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empiri- cal work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integra- tion tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata genera- tor that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composi- tion, schema evolution, among many others). iBench per- mits control over the size and characteristics of the meta- data it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently gener- ate very large, complex, yet realistic scenarios with differ- ent characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenar- ios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems. 1. INTRODUCTION Despite the large body of work in data integration, the typical evaluation of an integration system consists of exper- iments over a few real-world scenarios (e.g., the Amalgam Integration Test Suite [18] the Illinois Semantic Integration Archive, or Thalia [13]) shared by the community, or ad hoc synthetic schemas and data sets that are created for a spe- cific evaluation. Usually, the focus of such an evaluation is to exercise and showcase the novel features of an approach. It is often hard to reuse these evaluations. Patterson [19] states that “when a field has good bench- marks, we settle debates and the field makes rapid progress.” Our canonical database benchmarks, the TPC benchmarks, make use of a carefully designed schema (or in the case of TPC-DI, a data integration benchmark, a fixed set of source and destination schemas) and rely on powerful data gener- ators that can create data with different sizes, distributions and characteristics to test the performance of a DBMS. For data integration however, data generators must be accom- panied by metadata generators to create the diverse meta- data (schemas, constraints, and mappings) that fuel integra- tion tasks. To reveal the power and limitations of integra- tion solutions, this metadata must be created systematically controlling its detailed characteristics, complexity and size. Currently, there are no openly-available tools for systemati- cally generating metadata for a variety of integration tasks. 1.1 Integration Scenarios An integration scenario is a pair of a source and target schema (each optionally containing a set of constraints) and a mapping between the schemas. Integration scenarios are the main abstraction used to evaluate integration systems. Typically in evaluations, the relationship between the source and the target schema is controlled to illustrate different features of a system. We present the two primary ways such scenarios have been generated and illustrate how they have been used in evaluating different integration tasks. Example 1.1 (Primitives). Libraries of primitives can be used to create scenarios of different shapes and sizes. A set of synthetic schema evolution scenarios modeling mi- cro and macro evolution behavior were proposed for evaluat- ing mapping adaptation [24]. A set of finer-grained schema evolution primitives were proposed for testing mapping com- position [7]. Each primitive is a simple integration scenario. Similarly, a set of mapping primitives were proposed in ST- Benchmark, to permit the testing and comparison of map- ping creation systems [2]. The primitives used in each ap- proach depended on the integration task. In STBenchmark, the task was mapping creation or discovery. For compo- sition [7], some additional primitives that were trivial for mapping creation (like an add-attribute primitive) were used because they can make the composition non-trivial (i.e., the composition may be a second-order (SO) tgd [12]). Using mapping primitives one can either test a single solu- tion or compare multiple solutions on integration scenarios with different properties (by selecting different subsets of primitives). In addition, scalability can be tested by com- bining primitives and applying them repeatedly to generate larger scenarios (larger schemas and mappings). As an al- ternative to using mapping primitives to create integration scenarios, one can also use ad hoc integration scenarios tai- lored to test specific features of an integration method. This 1

The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Embed Size (px)

Citation preview

Page 1: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

The iBench Integration Metadata Generator

Patricia C. ArocenaUniversity of [email protected]

Boris GlavicIllinois Institute of Technology

[email protected]

Radu CiucanuUniversity of Oxford

[email protected]

Renee J. MillerUniversity of [email protected]

ABSTRACTGiven the maturity of the data integration field it is sur-prising that rigorous empirical evaluations of research ideasare so scarce. We identify a major roadblock for empiri-cal work - the lack of comprehensive metadata generatorsthat can be used to create benchmarks for different integra-tion tasks. This makes it difficult to compare integrationsolutions, understand their generality, and understand theirperformance. We present iBench, the first metadata genera-tor that can be used to evaluate a wide-range of integrationtasks (data exchange, mapping creation, mapping composi-tion, schema evolution, among many others). iBench per-mits control over the size and characteristics of the meta-data it generates (schemas, constraints, and mappings). Ourevaluation demonstrates that iBench can efficiently gener-ate very large, complex, yet realistic scenarios with differ-ent characteristics. We also present an evaluation of threemapping creation systems using iBench and show that theintricate control that iBench provides over metadata scenar-ios can reveal new and important empirical insights. iBenchis an open-source, extensible tool that we are providing tothe community. We believe it will raise the bar for empiricalevaluation and comparison of data integration systems.

1. INTRODUCTIONDespite the large body of work in data integration, the

typical evaluation of an integration system consists of exper-iments over a few real-world scenarios (e.g., the AmalgamIntegration Test Suite [18] the Illinois Semantic IntegrationArchive, or Thalia [13]) shared by the community, or ad hocsynthetic schemas and data sets that are created for a spe-cific evaluation. Usually, the focus of such an evaluation isto exercise and showcase the novel features of an approach.It is often hard to reuse these evaluations.

Patterson [19] states that “when a field has good bench-marks, we settle debates and the field makes rapid progress.”Our canonical database benchmarks, the TPC benchmarks,make use of a carefully designed schema (or in the case of

TPC-DI, a data integration benchmark, a fixed set of sourceand destination schemas) and rely on powerful data gener-ators that can create data with different sizes, distributionsand characteristics to test the performance of a DBMS. Fordata integration however, data generators must be accom-panied by metadata generators to create the diverse meta-data (schemas, constraints, and mappings) that fuel integra-tion tasks. To reveal the power and limitations of integra-tion solutions, this metadata must be created systematicallycontrolling its detailed characteristics, complexity and size.Currently, there are no openly-available tools for systemati-cally generating metadata for a variety of integration tasks.

1.1 Integration ScenariosAn integration scenario is a pair of a source and target

schema (each optionally containing a set of constraints) anda mapping between the schemas. Integration scenarios arethe main abstraction used to evaluate integration systems.Typically in evaluations, the relationship between the sourceand the target schema is controlled to illustrate differentfeatures of a system. We present the two primary ways suchscenarios have been generated and illustrate how they havebeen used in evaluating different integration tasks.

Example 1.1 (Primitives). Libraries of primitivescan be used to create scenarios of different shapes and sizes.A set of synthetic schema evolution scenarios modeling mi-cro and macro evolution behavior were proposed for evaluat-ing mapping adaptation [24]. A set of finer-grained schemaevolution primitives were proposed for testing mapping com-position [7]. Each primitive is a simple integration scenario.Similarly, a set of mapping primitives were proposed in ST-Benchmark, to permit the testing and comparison of map-ping creation systems [2]. The primitives used in each ap-proach depended on the integration task. In STBenchmark,the task was mapping creation or discovery. For compo-sition [7], some additional primitives that were trivial formapping creation (like an add-attribute primitive) wereused because they can make the composition non-trivial (i.e.,the composition may be a second-order (SO) tgd [12]).

Using mapping primitives one can either test a single solu-tion or compare multiple solutions on integration scenarioswith different properties (by selecting different subsets ofprimitives). In addition, scalability can be tested by com-bining primitives and applying them repeatedly to generatelarger scenarios (larger schemas and mappings). As an al-ternative to using mapping primitives to create integrationscenarios, one can also use ad hoc integration scenarios tai-lored to test specific features of an integration method. This

1

Page 2: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

approach was the norm before the primitive approach wasintroduced, and remains common.

Example 1.2 (Ad Hoc Scenarios). MapMerge [1] isa system for correlating mappings using overlapping sets oftarget relations, or using relations that are associated viatarget constraints. To evaluate their approach, the authorsused three real biological schemas (Gene Ontology, UniProtand BioWarehouse) and mapped the first two schemas to thethird (BioWarehouse). These real scenarios reveal applica-bility of their technique in practice, but do no allow con-trol over the metadata characteristics (number of mappings,their complexity, etc.). Unfortunately, the primitives of pre-vious approaches did not meet their needs as the mappingscreated by sets of primitives rarely shared target relationsand the scenario generators did not permit detailed controlover how target constraints were generated.

Hence, to evaluate MapMerge, the authors created theirown ad hoc integration scenarios. These were designed tolet the researchers control the degree of sharing (how manymappings used the same target relation) and the set of tar-get schema constraints, the two crucial parameters for theMapMerge algorithm. This set-up permitted control over themapping scenario complexity which was defined as the num-ber of hierarchies in the target schema. However, this defi-nition of complexity is not appropriate for other tasks (e.g.,mapping composition) that do not depend so directly on thecharacteristics of a schema hierarchy. Furthermore, this sce-nario generator is limited to this one specific schema shape.These characteristics make it unlikely that others can benefitfrom this test harness for evaluating their work.

The widespread use of ad hoc scenarios is necessitated bythe lack of a common, shared metadata generator.

1.2 ContributionsWe present iBench, a flexible integration metadata gener-

ator and make the following contributions.• We define the metadata-generation problem and detailimportant design decisions for a flexible, easy-to-use solu-tion to this problem (Section 2). Our problem definitionincludes the generation of (arbitrary) independent schemaswith an (arbitrary) set of mappings between them, wherethe independent variables controlled by a user include theschema size, the number of mappings (mapping size), themapping complexity, and the mapping language. The prob-lem definition also permits the generation of schemas wherethe relationship between the schemas is controlled in a pre-cise way. Hence, this relationship can be viewed as an in-dependent variable, controlled by a user of the generatorthrough the use of integration primitives.• We show that the metadata-generation problem is NP-hard and that even determining if there is a solution to aspecific generation problem may be NP-hard.• We present MDGen, a best-effort, randomized algorithm,(Section 3) that solves the metadata-generation problem.The main innovation behind MDGen is in permitting the fullcontrol over all independent variables describing the meta-data in a fast, scalable way.• We present an evaluation of the performance of iBench(Section 4) and show that iBench can efficiently generateboth large integration scenarios and large numbers of sce-narios. We show that iBench can easily generate scenarioswith over 1 Million attributes, sharing among mappings,and complex schema constraints in seconds. iBench can be

easily extended with new (user-defined) primitives and newintegration scenarios to adapt it to new applications andintegration tasks. We present a performance evaluation ofthis feature where we take several real-world scenarios andscale them up by a factor of more than 103 and combinethese user-defined primitives with native iBench primitives.We show that this scaling is in most cases linear.• We demonstrate the power of iBench by presenting a novelevaluation of three mapping discovery algorithms, Clio [10],MapMerge [1], and ++Spicy [15] (Section 5). Our evalu-ation systematically varies the degree of source and targetsharing in the generated scenarios. This reveals new in-sights into the power (and need for) mapping correlationand core mappings on complex scenarios. As the first gen-erator that varies the amount of generated constraints, weshow for the first time how this important parameter influ-ences mapping correlation. We also quantify how increasingtarget equality-generating dependencies (egds) improves thequality of ++Spicy’s mappings.

2. METADATA-GENERATION PROBLEMWe now motivate our design decisions and formalize the

metadata-generation problem. We use a series of exampleswhich showcase the need for certain features. This culmi-nates in a formal problem statement and a complexity anal-ysis which demonstrates that the problem is NP-hard.

2.1 Generating Integration ScenariosExample 2.1. Alberto has designed a new mapping in-

version algorithm called mapping restoration. Like the Fa-gin inverse [9], not every mapping has a restoration, but Al-berto believes that most mappings do. To demonstrate thisempirically, he would like to compare the success rates of im-plementations for restoration and the Fagin inverse. To dothis, he must be able to generate large sets of mappings.

Alberto’s problem would be solved by a mapping gener-ator. We define an integration scenario as a triple M =(S,T,Σ), where S and T are schemas and Σ is a set of log-ical formulas over (S,T). Relation (and attribute) namesin the schemas may overlap, but we assume we can iden-tify each uniquely (for example, using S.R.A and T.R.A).We generalize this definition later (Definition 2.14) to per-mit constraints (e.g., keys, foreign keys), instances of theschemas, and transformations implementing the mappings.

To fulfill Alberto’s requirements, a metadata generatoris needed that can efficiently generate integration scenarios.One approach would be to generate schemas and constraintsand apply an existing mapping discovery algorithm to gener-ate mappings for these schemas. However, this can be quiteexpensive and the output would be biased based on the id-iosyncrasies of the algorithm. We take a different approachmotivated by data generation and permit a user to controlthe basic metadata characteristics. Ideally, Alberto shouldbe able to specify the values (or admissible intervals of val-ues) for the characteristics he would like to control as aninput to the generator. The metadata generator then has tocreate an integration scenario (or many scenarios) that ful-fill these restrictions. This is only possible if the restrictedcharacteristics can be measured efficiently, e.g., the numberof source relations. To define this more precisely, we will as-sume relational schemas and mappings specified as source-to-target tuple-generating dependencies, arguably the mostcommonly used mapping language in integration [23].

2

Page 3: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Parameter Description

πSourceRelSize Number of attributes per source relationπTargetRelSize Number of attributes per target relation

πSourceSchemaSize Number of relations in source schemaπTargetSchemaSize Number of relations in target schema

πNumMaps Number of mappings

πSourceMapSize Number of source atoms per mapping

πTargetMapSize Number of target atoms per mapping

πMapJoinSize No. of attributes shared by two atoms

πNumSkolem % of target variables not used in sourceπSkolemType Type of value invention (all, random, key)

πSkolemSharing % of value inv. fcts shared amg mappings

πSourceSharing % mappings that share source relations

πTargetSharing % mappings that share target relations

Table 1: Scenario Parameters

Parameter Min Max

πSourceRelSize and πTargetRelSize 2 5

πSourceSchemaSize and πTargetSchemaSize 2 10

πNumMaps 1 5

πSourceMapSize 1 1

πTargetMapSize 1 10

πNumSkolem 0% 50%

Table 2: MD task: indep. schemas & LAV mappings

A source-to-target tuple-generating dependency [11] (s-ttgd) is a formula of the form ∀x(φ(x) → ∃y ψ(x,y)), wherex and y are disjoint vectors of variables; φ(x) and ψ(x,y)are conjunctions of atomic formulas over the source schemaS and target schema T, respectively. All variables of xare used in φ and all variables of y are used in ψ. S-ttgds are sometimes called global-and-local-as-view (GLAV)mappings [14]. Local-as-view (LAV) mappings have a singleatom in φ whereas global-as-view (GAV) mappings have asingle atom in ψ and no existential variables (y is empty).

A major choice in the design of a metadata generator iswhat scenario characteristics should be controllable by theuser. On the one hand, we would like to give the user fullcontrol over the generation, on the other hand, we do notwant to overwhelm her with a plethora of rarely used op-tions. Clearly, the number of mappings generated and sizeof these mappings (determined as the number of conjuncts)should be controllable. A user may also want to control theincompleteness in a mapping, i.e., the number of existentialvariables used in the target expression (‖y‖). Using thisintuition, one can construct a generator that creates two in-dependent schemas of a specified size (number of relationsand attributes per relation) and mappings (of a specifiedsize) between the schemas. To model this, we use scenarioparameters in our problem definition (Table 1) and allowa user to specify a minimum and maximum value for each.We will explain the last few later in this section.

Definition 2.2 (MD Task). Let Π be a set of scenarioparameters (shown in Table 1). A metadata-generation taskΓ is a set of constraints of the form minπ < π < maxπwhere π ∈ Π. For a given integration scenario M we saythat M is a solution for Γ if M |= Γ.

Note that we do not require that every scenario charac-teristic is constrained by an MD task. This enables a userto only vary the characteristics she wants to control.

Example 2.3. To evaluate his hypothesis that restorationis more successful for LAV than for GLAV mappings, Al-berto creates the MD Task in Table 2 for LAV mappings (andGLAV mappings by increasing parameter πSourceMapSize)and runs his algorithm over the generated mappings. A sce-nario that is a solution for the task in Table 2 is shown inFigure 1. The red lines represent correspondences (mappingvariables shared between the source and target), ignore theblack lines for now. Mappings for this scenario are below.

m1 : ∀a, b Cust[a, b]→ ∃c Customer[c, a, b]m2 : ∀a, b Emp[a, b]→ ∃c Person[c, b]

To see how the compliance of the scenario with the con-straints of the task is evaluated consider πSourceRelSize. Thetask requires that 2 ≤ πSourceRelSize ≤ 5. Since both relationCust and Emp have 2 attributes, the scenario fulfills thisconstraint. As another example, the task specifies that upto 50% of the target variables may be existential (Skolems).Both m1 and m2 fulfill this constraint ( 1

3and 1

2).

2.2 Controlling IncompletenessExample 2.4. Alberto has extended his system to sup-

port second-order (SO) tgds [12], a broader class of map-pings with finer control over incompleteness. He needs agenerator that can create SO tgd mappings and control theincompleteness. To create SO tgds that are not equivalent toany s-t tgd, he would like to control how arguments of Sko-lem functions representing unknown values are chosen andhow Skolem functions are reused among mappings.

While s-t tgds permit incompleteness via existential vari-ables, this incompleteness can also be represented via Sko-lem functions. Indeed, because of this, we called the sce-nario parameter controlling this πNumSkolem. The mappingm1 from Example 2.3 is equivalent to the following SO tgd.

m1 : ∃f ∀a, b Cust[a, b]→ Customer[f(a, b), a, b]

Using this representation, some limitations of s-t tgds be-come clear. First, all functions depend on all universallyquantified variables in the mapping. Second, two mappingscannot share the same value-invention (Skolem) function(i.e., function f cannot be used in both m1 and m2). Wedefine the metadata-generation problem to include scenariosthat use plain SO tgds [4], a strict superset of s-t tgds withmore flexible value-invention semantics. Arenas et al. [4] ar-gue that plain SO tgds are the right choice for importantintegration problems related to composition and inversion.

A plain SO tgd [4] is an existential second-order formula ofthe form: ∃f( (∀x1(φ1 → ψ1))∧· · ·∧(∀xn(φn → ψn)) ) where(a) each φi is a conjunction of relational source atoms overvariables from xi such that every variable from xi appears inat least one atom and (b) each ψi is a conjunction of atomicformulas that only uses terms of the form T (y1, ..., yk) whereT ∈ T and each yi is either a variable from xi or a termf(x1, ..., xk) where f is a function symbol in f and each xi ∈x. We refer to each clause of a plain SO tgd, ∀xi(φi → ψi),as a mapping to keep the analogy with s-t tgds.

To control the complexity of SO tgd generation, we use ad-ditional parameters. The πSkolemSharing parameter controlsthe percentage of Skolems (from f) shared between map-pings. The πSkolemType controls how the functions are pa-rameterized: for value All each function depends on all uni-versal variables in the mapping, for Random each functiondepends on a random subset of universal variables, and forKeys each function depends on the primary keys of the map-ping’s source relations. The settings of πSkolemSharing = 0%and πSkolemType = All, limits the mappings to s-t tgds.

Example 2.5. By modifying the parameter πSkolemTypein Table 2 to Random and πSkolemSharing to a max of 100%,the following two mappings may be created as a solution.

∃f (∀a, b Cust[a, b]→ Customer[f(a), a, b]∀a, b Emp[a, b]→ Person[f(b), b])

3

Page 4: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

In addition to supporting scenario parameters that con-trol schema and mapping size, we let a user limit the map-ping language to the important classes of s-t tgds, LAV (byrestricting the number of source atoms per mapping to 1using parameter πSourceMapSize), and GAV (by restrictingthe number of target atoms per mapping πTargetMapSize = 1and πNumSkolem = 0% ) that have been identified as impor-tant in a survey of mapping languages [23]. Many integra-tion systems do not support all languages, and even whenmultiple mapping languages are supported, a system mayprovide different performance depending on the language.

2.3 Schema PrimitivesExample 2.6. Alice has built a system that creates ex-

ecutable data exchange transformations from st-tgds. Tothoroughly test her approach, she needs a large number ofdiverse scenarios (schemas and mappings). She decides tofollow Alberto’s methodology to generate large sets of map-pings. For each she creates a transformation script and runsit over source databases (created by a data generator likeToXGene [6]). While she was able to test her system’s per-formance over a wide variety of metadata with little effort,she feels that the performance of her system on independentschemas with random mappings is not necessarily predic-tive of its performance for real world settings. Similar tohow database benchmarks such as TPC-H provide realisticqueries and data she would like to create scenarios that giveher a better understanding of how her system will performon real mappings that are likely to arise in practice.

As Alice in the example, current evaluations do not testintegration systems on arbitrary scenarios, rather they con-trol quite precisely the relationship between the source andtarget schema. This has often been modeled using map-ping primitives. A mapping primitive is a parameterizedintegration scenario that represents a common pattern. Forinstance, vertical and horizontal partitioning are commonpatterns that a metadata generator should support. Prim-itives act as templates that are instantiated based on userinput. A metadata generator should allow a user to deter-mine what primitives to use when generating an integrationscenario. By doing so, the user can control the relation-ship between the schemas. For example, she can use onlyprimitives that implement common transformations used inontologies (e.g., that increase or decrease the level of gener-alization in an isa hierarchy) or she can choose to use onlythose primitives that implement common relational schemaevolution transformations. The two existing mapping gener-ators follow this approach. The first using schema evolutionprimitives proposed by Yu and Popa [24] to test a mappingadaptation system and extended by Bernstein et al. [7] totest a mapping composition system. The second, STBench-mark, used a set of transformation primitives for testingmapping creation and data exchange systems [2].

Example 2.7. Consider the integration scenario in Fig-ure 1 that could be created using two primitives. The first,copy-and-add-attribute (ADD), creates the source rela-tion Cust(Name, Addr) and copies it to a target relation thatcontains another attribute, Customer(Name, Addr, Loyalty).The new attribute Loyalty does not correspond to any sourceattribute. The second primitive, vertical-partitioning-hasA(VH) creates a source relation, Emp(Name, Company), andvertically partitions it into: Person(Id, Name) and WorksAt

(EmpRec,Firm,Id). The VH primitive creates a has-a rela-tionship between the two target relations (modeled by a for-eign key (FK)). Specifically, the primitive VH creates newkeys for the two target relations (Id for Person and EmpRec

for WorksAt) and declares WorksAt.Id to be a FK for Person.

A generator should support a comprehensive set of primi-tives described in the literature and should be extensible toincorporate new, user-defined primitives (see Section 2.5).Furthermore, a generator should avoid proliferation of prim-itives. To achieve this, primitives should be parameterized.Primitives should work seamlessly with scenario parameters.The generator should create scenarios that include mappings(relations and constraints) created by primitives and inde-pendent mappings (relations and constraints) whose gener-ation is controlled by the scenario parameters.

Comprehensive Set of Primitives. A partial list ofprimitives is given in Figure 4 (the complete list of 17 prim-itives is shown in Appendix A.1.) We define the metadata-generation problem to use primitives including the mappingprimitives of STBenchmark [2] (e.g., VP) and the schemaevolution primitives [7, 24] (e.g., ADD). We also include asprimitives, ad hoc scenarios that we found in evaluations inthe literature. For example, VA (see Example 1.2) is theschema hierarchy scenario used to evaluate MapMerge [1],VH and VNM implement versions of chain-query LAV viewsand star-query LAV views used to evaluate the MiniCondata integration (answering queries using views) system [21].In the MD Task, a user specifies the number of times eachprimitive should be instantiated via a parameter θP (withdefault value of 0) where P is the name of the primitive.

Parameterized Primitives. As done in STBenchmark,primitives can be parameterized to change features like thenumber of relations created. This permits, as a simple ex-ample, a single parameter λNumPartitions (we will use λ forprimitive parameters and θ for setting the number of in-stances of a primitive) to be used to vertically partition arelation into two, three, or more relations.

Example 2.8. Alice has created the task in Table 3 (left)to create LAV mappings. Notice she has not set the source(or target) schema size parameter or the number of mappingsparameter. Hence, mappings will only be created by invokingprimitives. This configuration will create n source relationsand n mappings for 40 ≤ n ≤ 100 (depending on how manyprimitives are instantiated). The number of target relationsdepends on the number of partitions chosen for each prim-itive instantiation (these primitives use the λNumPartitionparameter so each primitive invocation will create between2 and 10 target relations). Alice finds that her transforma-tions performs well on the hundreds of scenarios she cre-ates using this configuration file. She now wants to test howquickly performance deteriorates as the scenarios become lesswell-behaved with more schemas and mappings being inde-pendent. She now uses the configuration file in Table 3 onthe right and creates 200 scenarios. These scenarios will allhave exactly 100 primitive invocations, but in addition willhave between 0 and 100 arbitrary mappings (and relations).She can then plot her performance vs. the percentage of in-dependent mappings (0 to 50% independent mappings).

Primitives and Scenario Parameters. To satisfy Alice’sneeds, unlike previous metadata generators (e.g., STBench-mark) where the scenario is generated as a union of a number

4

Page 5: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Source

Custaaaa a

Nameaaa

AddraaaaEmpaaaaaa

NameaaaCompany

Targetaaaaa

Customer

Nameaa

AddraaaLoyaltyi

Personaa

Idaa aaa

Nameaa

WorksAtaEmpRec

Firmaaa

Idaa aaa

Figure 1: Random Mapping (-),ADD and VH Primitives (-)

Source

Custaaaa a

Nameaaa

AddraaaaEmpaaaaaa

NameaaaCompany

Executive

Nameaaa

Positiona

Targetaaaaa

Customer

Nameaa

AddraaaLoyaltyi

Personaa

Idaa aaa

Nameaa

WorksAtaEmpRec

Firmaaa

Idaa aaa

Figure 2: ADL with Target Sharing

Source

Sales

Year1QT

2QT

3QT

4QT

Target

Perf

YearQuarter

Amount

Figure 3: A Pivot UDP

Name Description Example FKsADD Copy a relation and add new attributes R(a, b)→ S(a, b, f(a, b)) NoADL Copy a relation, add and delete attributes in tandem R(a, b)→ S(a, f(a)) NoCP Copy a relation R(a, b)→ S(a, b) NoHP Horizontally partition a relation into multiple relations R(a, b) ∧ a = c1 → S1(b)

R(a, b) ∧ a = c2 → S2(b)No

SU Copy a relation and create a surrogate key R(a, b)→ S(f(a, b), b, g(b)) No

VH Vertical partitioning into a HAS-A relationship R(a, b)→ S(f(a), a) ∧ T (g(a, b), b, f(a)) Yes

VI Vertical partitioning into an IS-A relationship R(a, b, c)→ S(a, b) ∧ T (a, c) YesVNM Vertical partitioning into an N-to-M relationship R(a, b)→ S1(f(a), a) ∧M(f(a), g(b)) ∧ S2(g(b), b) Yes

VP Vertical partitioning R(a, b)→ S1(f(a, b), a) ∧ S2(f(a, b), b) Yes

Figure 4: Exemplary Native Primitives

Parameter Min Max

θVH 10 25θV I 10 25θV NM 10 25θV P 10 25λNumPartitions 2 10πSourceRelSize 5 10πTargetRelSize 7 12

Parameter Min Max

θVH 25 25θV I 25 25θV NM 25 25θV P 25 25λNumPartitions 3 3πSourceRelSize 5 10πTargetRelSize 7 12

πNumMappings 100 200

πSourceSchemaSize 100 200πTargetSchemaSize 300 400

Table 3: Example MD Tasks

of primitive instantiations, we define the metadata genera-tion problem in a more flexible fashion where the user canrequest a number of instances for each primitive type and/oruse scenario parameters to generate independent metadata.

2.4 Sharing Among MappingsExample 2.9. Alice notices an anomaly in her results.

Even though her transformations remove subsumed (redun-dant) tuples created by older transformation systems likeClio [20], varying the amount of redundancy in the sourceinstances does not effect performance for schemas createdonly using primitives. She realizes that no target relation isbeing populated by more than one mapping, i.e., there is noredundancy created in the target. She has forgotten to set animportant iBench parameter controlling the degree to whichmappings can share source or target relations.

By combining multiple instances of the primitives, a usercan generate diverse integration scenarios with a great dealof control over the scenario characteristics. However, whilereal-world schemas contain instances of these primitives,they often contain metadata that correspond to a combi-nation of primitives. To create such realistic scenarios, it isimportant to permit sharing of metadata among mappingsand control the level of sharing. For example, in a real worldscenario typically some source/target relations may be usedby multiple mappings (e.g., employees and managers froma source schema are both mapped to a person relation inthe target schema), but it is uncommon for all mappings to

share a single source relation. This type of sharing of meta-data has been recognized as an important feature by Alexeet al. [2], but they tackle this issue by providing some prim-itives that combine the behavior of other simpler primitives.

Example 2.10. Suppose we apply a third primitive, copy-add-delete-attribute (ADL). Without sharing, this prim-itive would create a new source and target relation wherethe latter is a copy of the former excluding (deleting) somesource attribute(s) and creating (adding) some target at-tribute(s). With sharing, primitives can reuse existing re-lations. If we enable target sharing, then an application ofADL may create a new source relation Executive and copyit into an existing target relation. In our example, see Fig-ure 2, the existing target Person is chosen (this relation ispart of a VH primitive instance). ADL deletes the sourceattribute Position and adds the target attribute Id. Byaddition, we mean that no attribute of the source relationExecutive is used to populate this target attribute. Noticethat the resulting scenario is a very natural one. Parts of thesource relations Emp and Executive are both mapped to thetarget Person relation while other parts (other attributes) ofthese source relations are partitioned into a separate relation(in the case of Emp) or removed (in the case of Executive).

To create realistic metadata, a generator could create newprimitives that represent combinations of the existing primi-tives (as done in STBenchmark). However, implementing allpossible combinations is infeasible in terms of implementa-tion effort. In addition, it is likely overwhelming for a user tochoose from long lists of primitives, making such a generatorhard to use. We define the metadata-generation problem us-ing an alternative route under which a user is able to controlthe amount of source or target sharing among invocationsof the primitives and among independent mappings.

2.5 User-Defined PrimitivesExample 2.11. Alice wishes to test her solution on map-

pings that include a pivot. A simple example of this is de-picted in Figure 3. There are four separate source attributes

5

Page 6: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

representing quarters, each containing the sales amount forthat quarter. In the target, information about quarters isrepresented as data (the Quarter attribute). The desiredmapping is mp (omitting the universal quantification).

mp : Sales(Y,1QT,2QT,3QT,4QT) → Perf(Y, ’1’, 1QT),Perf(Y, ’2’, 2QT), Perf(Y, ’3’, 3QT), Perf(Y, ’4’, 4QT)

Sharing provides great flexibility in creating realistic andcomplex scenarios. However, a user may sometimes requireadditional control. For example, a proposed integrationtechnique may be designed to exploit scenarios with a veryspecific shape. Despite our attempt to be comprehensive increating primitives, there may be other shapes of interest. Inaddition, it is often helpful to use real scenarios (for example,the real schemas currently being used to evaluate integrationsystems) as building blocks in creating integration scenar-ios. We do not see metadata generation as a replacement forthe use of real schemas, real mappings and real data, ratherwe would like generation to make such real scenarios evenmore valuable. Hence, we include in the metadata problemdefinition, the use of user-defined primitives (UDP).

The advantage of supporting UDPs is that we can system-atically scale these scenarios (creating new scenarios withmany copies of the new primitive) and can combine themwith other native primitives or other UDPs. All native prim-itives and UDPs should be able to share metadata elements.

Example 2.12. By creating a new primitive Pivot, Alicecan now combine this primitive with other primitives to cre-ate large scenarios with many pivots and can combine thesewith vertical or horizontal partitioning sharing metadata (re-lations) if desired between the UDP and these native primi-tives. This permits her to understand the impact of pivoting(accuracy and performance) on her integration task.

In Section 4.2, we will show how some already commonlyused real scenarios can be turned into UDPs.

2.6 Generation of Other MetadataExample 2.13. Patricia has designed a new algorithm to

rewrite SO tgds into s-t tgds. Her algorithm uses functionaldependencies (FDs) to rewrite SO tgds that could not berewritten without these additional constraints. The algo-rithm is sound, but not complete. To test her algorithm,she needs to not only be able to generate SO tgds, but alsovary the amount of source and target constraints. We havedefined the metadata-generation problem to include not onlymappings, but also constraints and as a result, she was ableto use an early prototype of iBench to generate 12.5 millionSO tgds and test the success rate of her rewriting algorithminto s-t tgds as the number of constraints changed [5].

Primitives permit the controlled generation of specific con-straints. In addition a metadata generator should be ableto generate arbitrary constraints. We define the metadata-generation problem to include FDs (including keys) andINDs (including foreign-keys). The user must be able tocontrol the type and complexity of constraints (e.g., thenumber of attributes in key constraints or the number ofnon-key functional dependencies). Hence, the full list of sce-narios parameters (28 parameter) also includes parametersfor this purpose (see Tables 4 and 5). In addition, we havescenario parameters to control things like renaming. By de-fault, primitives (and independent mappings) reuse sourcenames in the target and invent random names (using thename generator of STBenchmark) for target attributes and

relations that are not mapped from the source. This canbe changed to invoke renaming on some or all mappings.Renaming is implemented as a plug-in, we plan to includemore flexible name generators in the future.

In addition, a user may (optionally) request the generationof data (just source, or source and target), and transforma-tion code that implements the generated mapping. We nowgive the complete definition of an integration scenario.

Definition 2.14. An integration scenario as a tupleM =(S,T,ΣS,ΣT,Σ, I, J, T ), where S and T are schemas, ΣS

and ΣT are source and target constraints, Σ is a mappingbetween S and T, I is an instance of S satisfying ΣS, J isan instance of T satisfying ΣT, and T is a program thatimplements the mapping Σ.

The user may request any subset of scenario metadatatypes. In addition, she may specify whether J should bea solution for I (meaning (I, J) |= Σ and T (I) = J) orwhether J should be an independent instance of T.

2.7 Complexity and UnsatisfiabilityWe now formally state the metadata-generation problem

and investigate its complexity, showing that it is NP-hardin general. Furthermore, we demonstrate that there existtasks for which there is no solution. Proofs to the theoremscan be found in Appendix B. These results have informedour MDgen algorithm presented in Section 3.

Definition 2.15 (Metadata-Generation Problem).The metadata-generation problem is, given an MD task Γ toproduce a solution M for Γ.

Theorem 2.16 (Complexity). The metadata-generationproblem is NP-hard.

Furthermore, the problem of checking whether a scenarioMis a solution for a task is also NP-hard.

Theorem 2.17. Let Γ be a task and M a scenario. Theproblem of checking whether M |= Γ is NP-hard.It is possible to define tasks for which no solution exists. Forinstance, consider a task with no target sharing, a targetschema size of 1, and two CP primitive instances. Thereis no integration scenario that is a solution for this task,because for a scenario to contain two instances of the CPprimitive with no sharing of target relations, the scenariohas to contain at least two target relations.

3. THE MDGEN ALGORITHMWe now present our MDGen algorithm that solves the

metadata-generation problem for an input task Γ. Sincethe problem is NP-hard in general, a polynomial-time so-lution must either be an approximation or incomplete. Asdiscussed in Section 2.7, the user may accidentally specifya task that has no solution. We could deal with this byaborting the generation and outputting the conflicting re-quirements (to help a user see what needs to be relaxedto avoid conflicts). Or alternatively, we can relax the con-straints (e.g., by increasing the relation size) and producea solution for the relaxed constraints. We have chosen thesecond approach since we found it more useful in using thegenerator ourselves, as a relaxation that increases the sizeof one or more scenario parameters is usually sufficient. Wepresent a greedy algorithm where we never reverse a deci-sion once it has been made. Since our algorithm is best-effort(and checking if a solutions exists is hard), we might relaxthe constraints even when an exact solution exists.

6

Page 7: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Algorithm 1 MDGen (Γ)

1: Initialize empty Integration Scenario M{Instantiate Primitives}

2: for each θ ∈ Θ do3: numInst = random(minθ,maxθ)4: for i ∈ {1, . . . ,numInst} do5: InstantiatePrimitive(θ,M,Γ)

{Add Independent Mappings/Relations}6: RandomMappings(M,Γπ){Value Invention}

7: ComplexValueInvention(M,Γ){Generate Additional Schema Constraints}

8: GenRandomConstraints(M,Γ){Rename Schema Elements}

9: ApplyRenamingStrategy(M,Γ){Generate Data}

10: for each S ∈ {S,T} do11: for each R ∈ S do12: genData(M, R,Γ)

3.1 OverviewAlgorithm 1 (MDGen) takes as input an MD Task Γ. We

first instantiate primitives (native and UDPs). For eachθ ∈ Θ, we uniformly at random create a number of instances(described in more detail in Section 3.2) that lies betweenminθ and maxθ (Lines 1 to 2 of Algorithm 1). During thisprocess we keep track of the characteristics of the scenariobeing generated. We always obey the restrictions on thenumber of primitive instantiations and primitive parame-ters, but allow violations of scenario parameters. Thus, weguarantee that a returned solution obeys ΓΘ and Γλ (whichis always possible), but not that this solution fullfils ΓΠ.To create independent parts of the schemas and mappings,we use a primitive called random mappings (Line 6). Thisprimitive takes as input the number of source relations, tar-get relations, and mappings to create. These numbers areset to the number requested in ΓΠ minus the number of eachthat have already been created by primitives. If the primi-tives have already created larger schemas or more mappingsthan requested by ΓΠ, then additional metadata is not cre-ated (and the solution will violate the respective scenarioparameters). We found this solution most satisfying forusers. In general, if the relationship between the source/-target schemas does not matter for an evaluation, then theuser will just use scenario primitives (recall Example 2.3where Alberto just needed a large set of random mappingsover independent schemas) and the generator will alwaysfind a correct solution. However, if the relationship mat-ters and primitives are used, we always satisfy the primitiveconstraints, and relax, if necessary, scenario restrictions.

Additional scenario parameters are handled within sepa-rate post-processing steps which are applied to the scenario(with mappings) that we have created. These include valueinvention, the generation of random constraints, and therenaming of target relations and attributes (if requested).For reasons of space we only explain one exemplary of thesepostprocessing steps, value invention, in Section 3.3.

We also generate data for the source schema (by callingToxGene, a general purpose data generator [6]). The usercan control the number of tuples generated per relation (viathe scenario parameter πRelInstSize). In the future, we planto give the user control over which value generators are usedto create attribute values (types of data generators and as-

Algorithm 2 InstantiatePrimitive (θ,M,Γ)

1: {Determine Sharing}2: sourceShare = (random(0, 100) < πsourceShare)3: targetShare = (random(0, 100) < πtargetShare){Determine Restrictions on Scenario Elements}

4: Req = determinePrimReqs(θ,Γ){Generate Source Relations}

5: for i ∈ {1, . . . , Req(‖S‖)} do6: if sourceShare = true then7: tries = 08: while R 6|= Req ∧ tries+ + < MAXTRIES do9: R = pickSourceRelation(M,Γ, i)

10: end while11: if R 6|= Req then12: R = genSourceRelation(M, Req, i)

13: else14: R = genSourceRelation(M, Req, i)

{Generate Target Relations}15: Repeat Lines 5 to 14 using Target Schema{Generate Mappings}

16: for i ∈ {1, . . . , Req(‖ΣST ‖)} do17: σ = genMapping(M, Req, i)

{Generate Transformations and Constraints}18: genFKs(M, Req)19: genTransformations(M, Req)

sociated probability to use them). However, this is not im-plemented in the current release.

3.2 Instantiate PrimitivesAs mentioned before, our algorithm is greedy in the sense

that once we have created the integration scenario elementsfor a primitive instance we never remove these elements.In this fashion, we iteratively accumulate the elements ofa scenario by instantiating primitives. Algorithm 2 is usedto instantiate one primitive of a particular type θ. WhenπSourceShare or πTargetShare are not set (their default is0), each primitive instantiation creates new source and/ortarget relations. Otherwise, we achieve reuse among map-pings by using relations created during a previous primi-tive instantiation instead of generating new relations. Thechoice of whether to reuse existing relations when instanti-ating a primitive is made probabilistically (Lines 2 and 3)where πSourceShare (respectively, πTargetShare) determinesthe probability of reuse. Once we have determined whetherwe would like to reuse previously generated schema elementsor not, the next step is to determine the requirements Reqon scenario elements based on the primitive type θ we wantto instantiate, the scenario parameters ΓΠ, and primitiveparameters Γλ (Line 4). Recall that our algorithm makes abest-effort attempt to fulfill the input task restrictions. Toavoid backtracking and to resolve conflicts between primitiverequirements and the scenario restrictions our algorithm vio-lates scenario parameters if necessary. For instance, assumethe user requests that source relations should have 2 at-tributes (πSourceRelSize) and that VP primitives should splita source relation into three fragments (πNumPartitions). Itis impossible to instantiate a VP primitive that fulfills theseconditions, because to create three fragments, the source re-lation has to contain at least three attributes. We definea precedence order of parameters and relax restrictions ac-cording to this order until a solution is found. In this exam-ple, we would choose to obey the restriction on πNumPartitionsand violate the restriction on πSourceRelSize. In Lines 5 to

7

Page 8: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Algorithm 3 ComplexValueInvention (M,Γ)

1: V I ← attribute→ Skolem {Map from attributes to Skolems}{Associate Victim Attributes With Skolems}

2: for each i < (getNumAttrs(S) ·ΠSkolemSharing) do3: S.A = pickRandomAttr(S) {Pick random attribute}4: ~A = pickRandomAttr(S, πSkolemType)5: f = pickFreshSkolemName()

6: Add(VI, S.A, f( ~A))

{Adapt Mappings}7: for each σ ∈ Σst do8: for each (S.A→ f( ~A)) ∈ V I do9: if S ∈ σ then

10: xA ← vars(σ, S.A)

11: args← vars(σ, ~A)

12: term← f( ~A)[ ~A← args]13: for each R(~x) ∈ RHS(σ) do14: R(~x)← R(~x)[xA ← term]

14, we generate source relations from scratch or reuse ex-isting source relations. Since not every source relation canbe reused (it may not fit the requirements of the primitive)we select relations and check whether they meet the require-ments for the current primitive. These source relations arenot picked completely at random, but we only choose amongrelations that fulfill basic requirements of the primitive (e.g.,a minimum number of attributes) to increase the chance thatsharing is successful. After MAXTRIES tries we fall back togenerating a source relation from scratch. When generat-ing a relation we also generate primary key constraints ifrequired by the primitive or requested by the user. Thegeneration of target relations (Lines 15) is analogous to thesource relation generation. Then, we generate mappings be-tween the generated source and target relations. The struc-ture of the generated mappings is given by the primitivetype (see Table 4) and further influenced by certain primi-tive and scenario parameters. For instance, θNumPartitionsdetermines the number of fragments for a VP primitive andπSourceRelSize determines the num. of attributes per rela-tion. The generation of transformations is similar.

3.3 Value InventionPart of the value invention process has already been com-

pleted while instantiating primitives. Based on the param-eter πSkolemType, we have parameterized value invention inmappings as described in Section 2.2. In addition, we use thescenario property πSkolemSharing to control how much addi-tional sharing of Skolem functions we inject into mappings.Algorithm 3 outlines the process. We randomly select at-tributes from the generated source schema (called victims),and associate them with fresh Skolem terms that depend onother attributes from this source relation (Lines 2-6). Foreach selected attribute S.A, the attributes which are usedas the arguments of the Skolem function are chosen basedon parameter πSkolemType, e.g., if πSkolemType = Keys thenwe use the key attributes of relation S. In Lines 7-14, werewrite any mappings that use a victim to now use the gen-erated Skolem term. Here ψ[x ← y] denotes the formuladerived from ψ by replacing all occurrences of x with y.For instance, assume the Skolem term f(a) was assigned toattribute b of relation S(a, b). We would transform the map-ping S(x1, x2) → T (x1, x2) into S(x1, x2) → T (x1, f(x1)).Hence, if S is used in three mappings, then the Skolem func-tion f will be shared by three mappings.

3.4 Complexity and CorrectnessAside from solving the MD generation problem, the main

goal in the design of MDGen was scalability. MDGen runsin PTime in the size of the generated integration scenario.

Theorem 3.1 (Complexity). The MDGen algorithmruns in time polynomial in n, where n is the maximum ofthe number of created relations and created mappings.

Since by design the algorithm may violate input scenario pa-rameters, in order to produce a solution without backtrack-ing, a main question is what kind of correctness guaranteescan be given for the solution returned by the algorithm un-der such relaxations of the inputs. If the algorithm does nothave to relax any restriction, then the scenario produced bythe algorithm is a solution to the input task.

Theorem 3.2 (Correctness). Let Γ be an MD task.MDgen fulfills the following correctness criteria:• If MDgen returns a result without relaxing any restric-

tions given by the task, then the result is a solution for Γ.• A solution returned by the algorithm always conforms to

the primitive restrictions given in the input task.• Let Γrelax be the MD task generated by replacing the re-

strictions in the task Γ with their relaxed version produced bythe algorithm when running over Γ. The scenario returnedby the algorithm is a solution for Γrelax.

4. iBENCH EVALUATIONWe now evaluate the scalability of iBench for various tasks,

using native primitives and UDPs. We ran our experimentson an Intel Xeon X3470 with 8× 2.93 GHz CPU and 24GBRAM, reporting averages over 100 runs.

4.1 Native Metadata ScenariosWe conducted four experiments to investigate the influ-

ence of the schema size on the time needed to generate largemetadata scenarios. All four experiments share a baselineconfiguration, that uses the same parameter settings for thenumber of attributes per relation (5-15), the same range forthe size of keys (1-3), same size for mappings (2-4 source re-lations and 2-4 target relations). We generated scenarios ofvarious sizes (100 up to 1M attributes where we use the sumof source and target attributes as a measure of schema size)by using the baseline configuration and varying the numberof primitives. In these experiments, iBench always producedsolutions satisfying all configuration constraints.

Figure 5(a) shows the metadata generation time in sec-onds for generating four types of scenarios (on logscale). Thefirst configuration denoted as (0,0,0) has no constraints, nosource sharing, and no target sharing. The other three con-figurations are created by introducing 25% FD constraints(25,0,0), 25% source sharing (0,25,0), and 25% target shar-ing (0,0,25), respectively. iBench can generate scenarios with1M attributes in 6.3 sec for the (0,0,0) configuration andshows a linear trend as expected. For the 25% constraintconfiguration, we also observe a linear trend: 13.89 sec fora 1M attribute scenario. For the source and target shar-ing configurations, we observed a non-linear trend. WhileiBench generates scenarios with 100K attributes in 2.1 and2.5 sec, respectively, for 1M attributes, iBench requires sev-eral minutes. Here, we noticed high variance in elapsedtimes: 215.91 sec with a standard deviation of 11.67 secfor source sharing, and 213.89 sec with a standard devia-tion of 14.14 sec for target sharing. This variance is due

8

Page 9: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

10-2

10-1

100

101

102

103

102 103 104 105 106

Genera

tion T

ime (

Seco

nds)

# Generated Attributes

(0, 0, 0)(25, 0, 0)(0, 25, 0)(0, 0, 25)

(a) Gen. Time (s) - Native

0

10

20

30

0 200 400 600 800 1000

Genera

tion T

ime (

Seco

nds)

Instances of Loaded Primitive

A1A2A3FH

B1B2B3

(b) Gen. Time (s) - UDPs

10K

20K

30K

40K

50K

00 200 400 600 800 1000

# G

enera

ted R

ela

tions

Instances of Loaded Primitive

A1A2A3FH

B1B2B3

(c) Gen. Relations: UDPs

50K

100K

150K

200K

00 200 400 600 800 1000

# G

enera

ted A

ttri

bute

s

Instances of Loaded Primitive

A1A2A3FH

B1B2B3

(d) Gen. Attributes: UDPs

Figure 5: iBench Scalability Results for Native and UDP Primitives

0.5M

1M

1.5M

2M

2.5M

0CM CM CM CM CM CM CM CM CM CM

Targ

et

Siz

e (

# A

tom

s)

Ontology STB Mapping Primitives

Constants Nulls

28272625242827262524

(a) Ontology+STB: #Mappings

0

0.1

0.2

0.3

0.4

0.5

24 25 26 27 28R

el. Im

pro

vem

en

t

Mapping Primitives

Ont-TotalOnt-NullsSTB-TotalSTB-Nulls

(b) Rel. Improv.: 6(a)

2-2

20

22

24

26

28

210

212

24 25 26 27 28

Tim

e (

Seco

nd

s)

Mapping Primitives

Ont-COnt-MSTB-CSTB-M

(c) Gen. Time: 6(a)

23

25

27

29

211

213

24 25 26 27 28

Tim

e (

Seco

nd

s)

Mapping Primitives

Ont-COnt-MSTB-CSTB-M

(d) Exec. Time: 6(a)

Figure 6: Scalability of Clio (C) and MapMerge (M)

to the greedy, best-effort nature of the sharing algorithm.Despite this, we are still able to generate in reasonable timescenarios that are 10-100 times larger, and with much morerealistic sharing among mappings, than used in any previousevaluation that we are aware of.

4.2 UDP-based Metadata ScenariosTo analyze the performance of iBench’s UDP feature, we

used seven real-life metadata scenarios from the literatureand we scale them by factors of up to 1,000 times. Thefirst three UDPs are based on the Amalgam Schema andData Integration Test Suite [18], which describes metadataabout bibliographical resources. We denote them by A1,A2, and A3. The next three UDPs are based on a biologicalmetadata scenario called Bio [3], which employs fragmentsof the Genomics Unified Schema GUS (www.gusdb.org) andthe generic Biological Schema BioSQL (www.biosql.org).We denote them by B1, B2, and B3. The last UDP (denotedas FH) is based on a relational encoding of a graph dataexchange setting [8], comprising information about flights(with intermediate stops) and hotels. For each UDP, wevary the number of instances from 0 to 1,000. In all cases,we also requested 15 instances of each native primitive. Wepresent the generation time in Figure 5(b), and the numbersof generated relations and attributes in Figure 5(c) and 5(d),respectively. As we scale the number of loaded UDPs, themetadata generation time grows linearly in most cases. Weobserve the worst behavior for A2 and A3, which contain thelargest number of relations and up to 40 FKs (Figure 5(c)).While profiling our code, we realized that this non-linearbehavior is due to a limitation in the Java-to-XML bindinglibrary we used for manipulating relations in the UDP andwill be optimized for future iBench releases.

5. INTEGRATION SYSTEM EVALUATIONTo showcase how iBench can be used to empirically eval-

uate schema mapping create, we present a novel evaluation

comparing MapMerge [1] against Clio [10] and ++Spicy [15].

Systems. Clio creates mappings and transformations thatproduce a universal solution, MapMerge correlates Clio map-pings to remove data redundancy, while ++Spicy createsmappings and transformations that attempt to produce coresolutions [17]. We obtained Clio and MapMerge from theauthors [1] and downloaded ++Spicy from the web (www.db.unibas.it/projects/spicy). Since these three systemscreate mappings, we used iBench to generate schemas, schemaconstraints, mappings, and instance of the source schemas.However, we passed into the three systems only the schemas,constraints and correspondences. We omitted the iBenchmappings, so each system created its own mappings and owntransformation code (which we ran over the iBench data tocreate target data that we could compare). We used anIntel Core i7 with 4×2.9 GHz CPU and 8 GB RAM. Noticethat this is a different machine than the rack server used inSection 4 since all three systems we compare required inputgiven through a user-interface. As this approach is morehuman labor-intensive, here we report averages over 5 runs.

Original MapMerge Evaluation. The original evalua-tion compared transformations that implement Clio’s map-pings (input to MapMerge) with transformations that im-plement correlated mappings produced by MapMerge. Theevaluation used two real-life scenarios of up to 14 mappings,and a synthetic scenario (Example 1.2) with one source re-lation, up to 272 binary target relations and 256 mappings.It was concluded that MapMerge improved the quality ofmappings by both reducing the size of the generated targetinstance, and increasing the similarity between the sourceand target instances. We implemented the synthetic sce-nario used in the original evaluation as an iBench primitive(VA) both to further our goal of having a comprehensiveset of primitives and to perform a sanity (reproducibility)check (see list of primitives in Appendix A.1). We obtainedthe same generated instance size and comparable times tothe original evaluation (these measures are described below).

9

Page 10: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Our goal in the new experiments was to test the originalexperimental observations over more diverse and complexmetadata scenarios. In particular, we use iBench to gen-erate scenarios with a variable number of constraints andvariable degree of sharing to explore how these importantfactors influence mapping correlation.

Original ++Spicy Evaluation. ++Spicy evaluated thequality of its solutions using four fixed scenarios from theliterature with 10 (or fewer) tgds and 12 (or fewer) egds.They compared the size of core solutions that were com-puted w.r.t. tgds only with the size of core solutions thatused both tgds and egds. They also used synthetic scenar-ios (created using STBenchmark and similar to our STB sce-narios described below), to evaluate the scalability of theiralgorithms. While their scalability experiments control thenumber of constraints, their quality evaluation did not. Ourgoal was to quantify the quality gains of ++Spicy as thenumber of constraints is varied and as the degree of sharingis varied (hence on more complex scenarios). Also we com-pare ++Spicy directly to MapMerge on the same scenarios,a comparison that has not been done before.

Metadata Scenarios. We designed two scenarios usingiBench. The Ontology scenarios consist of primitives thatcan be used to map a relational schema to an ontology.Three of these primitives are different types of vertical par-titioning, into a HAS-A, IS-A, or N-to-M relationship (VH,VI, VNM). The fourth primitive is ADD (copies and addsnew attributes). The STB scenarios consist of primitivessupported by STBenchmark [2]: CP (copy a relation), VP(vertical partitioning), HP (horizontal partitioning), and SU(copy a relation and create a surrogate key). The Ontologyscenarios produce at least twice as many value inventions(referred to hereafter as nulls) as the STB scenarios (e.g.,one instance of each ontology primitive - ADD, VH, VI,and VNM - yields 8 nulls, while one instance of each STBprimitive - CP, VP, HP, and SU - only yields 4 nulls). TheOntology scenarios also include more keys compared to STB.

Measures. Intuitively, mappings that produce smaller tar-get instances are more desirable because they produce lessincompleteness. To assess the quality of the correlated map-pings, we measure the size of the generated target instanceand the relative improvement (of MapMerge or ++Spicyw.r.t. Clio), where the size of an instance is the number ofatomic values that it contains [1]. Assuming that the Cliomappings produce an instance of size x and the MapMergemappings produce an instance of size y, the relative improve-ment is (x − y)/x. The original MapMerge evaluation [1]used this measure and also measured the similarity betweensource and target instances using the notion of full disjunc-tion [22]. We were not able to use full disjunction in our ex-periments because we use arbitrary constraints and sharing,and these features often break the gamma-acyclicity pre-condition required for the implementation of this measure.Instead, we measure the number of nulls in the generatedtarget instance. ++Spicy used a similar measure, however,they measure size in tuples rather than atomic values. Wealso report the performance in terms of time for generatingmappings and time for executing them.

5.1 Scalability of Clio and MapMergeFor the two aforementioned scenarios, we generate an

equal number of instances for each primitive for varyingpowers of two. On the x-axis of Figures 6(a) to 6(b) we

report the total number of primitives, e.g., the value 128 forOntology scenarios corresponds to primitives ADD, VH, VI,VNM each occurring 128/4 = 32 times. We generated ran-dom INDs for 20% of the relations. We used 25% source andtarget sharing for Ontology scenarios. For STB scenarios,we also used 25% target sharing, but 50% source sharing tocompensate for sharing that naturally occurs (e.g., primi-tives such as HP generate multiple mappings sharing rela-tions, hence sharing occurs even if 0% sharing is requested).We set source relation size parameter (2-6) and create asource instance with 1,000 tuples per relation.

The target instance sizes are shown in Figure 6(a) - asexpected linear in the number of primitives. Clio (C) pro-duces the same number of constants and more nulls thanMapMerge (M). The Ontology scenarios produce more nullsand larger instances than the STB scenarios. Figure 6(b)shows that in general, the relative improvement of Map-Merge over Clio is higher for Ontology compared to STB. Animportant outcome of our experiments is that the benefits ofMapMerge w.r.t. Clio mappings are more visible in scenariosinvolving more incompleteness. We notice that the relativeimprovement remains more or less constant (the exceptionis on small STB scenarios, where Clio and MapMerge arealmost indistinguishable), because the characteristics of thescenarios are the same for different sizes.

We present the mapping generation and execution time(logscale) in Figure 6(c) and 6(d), respectively. Generat-ing and executing mappings for the Ontology scenarios takeslonger than for the STB scenarios due to the amount of nulls.Although MapMerge requires more time than Clio to gen-erate mappings, it requires less time to execute them. Thisbehavior is more visible for larger number of mappings be-cause in our experiments this implies larger target instances(up to 2M atoms, in contrast to the original MapMerge eval-uation that only considered up to 120K atoms).

Our findings extend the original MapMerge evaluation fortwo reasons: (i) they do not report mapping execution time(they have only the generation time), and (ii) MapMerge hasgreater benefit on the Ontology scenarios, which are definedusing iBench primitives that are not covered by previous sce-nario generators nor by the existing MapMerge evaluation.Indeed, the very flexible value invention provided by iBenchreveals another strong point about MapMerge, not consid-ered before. MapMerge not only generates smaller instancescompared to Clio, but is also more time efficient.

5.2 Impact of Constraints and SharingWe now study the impact of random constraints and shar-

ing for relations with (3-7) attributes and 1,000 tuples. Weran three experiments, and for each scenario, we use 10 in-stantiations of each of its primitives. First, we vary thenumber of source and target INDs from 10 to 40%, with nosource or target sharing. We present the size of the gener-ated target instance for both scenarios in Figure 7(a) andthe relative improvement in Figure 7(b). Here the relativeimprovement is not constant, but improves as the number ofconstraints increases. Second, we vary target sharing from0-60% and no random constraints. Source sharing is fixedto 20% for Ontology and to 50% for STB. We present thesize of the generated target instance for both scenarios inFigure 7(c) and the relative improvement in Figure 7(d).Third, we vary the amount of source sharing from 0-60%.Target sharing is fixed to 20% and there are no random con-

10

Page 11: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

100K

200K

300K

400K

500K

0CM CM CM CM CM CM CM CM

Targ

et

Siz

e (

# A

tom

s)

Ontology STB % Constraints

Constants Nulls

4030201040302010

(a) % Constraints

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 20 30 40

Rel. Im

pro

vem

en

t

% Constraints

Ont-TotalOnt-NullsSTB-TotalSTB-Nulls

(b) Rel. Improv.: 7(a)

100K

200K

300K

400K

500K

0CM CM CM CM CM CM CM CM

Targ

et

Siz

e (

# A

tom

s)

Ontology STB % Target Sharing

Constants Nulls

60402006040200

(c) Target Sharing

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 20 40 60

Rel. Im

pro

vem

en

t

% Target Sharing

Ont-TotalOnt-NullsSTB-TotalSTB-Nulls

(d) Rel. Improv.: 7(c)

100K

200K

300K

400K

500K

0CM CM CM CM CM CM CM CM

Targ

et

Siz

e (

# A

tom

s)

Ontology STB % Source Sharing

Constants Nulls

60402006040200

(e) Source Sharing

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 20 40 60

Rel. Im

pro

vem

en

t

% Source Sharing

Ont-TotalOnt-NullsSTB-TotalSTB-Nulls

(f) Rel. Improv.: 7(e)

Figure 7: Evaluation of Clio (C) and MapMerge (M)

straints. Target instance sizes are shown in Figure 7(e) andthe relative improvement in Figure 7(f).

Our experiments reveal that both sharing and constraintsincrease the relative improvement of MapMerge over Clio.We observe that for STB, the biggest gain comes from usingconstraints, while for the Ontology scenarios the biggest gaincomes from target sharing. This suggests that the benefitsshown in Figure 6(b) (for scenarios combining constraintsand sharing) are due to both factors. MapMerge is mostuseful for mappings that share relations or contain INDs Ourresults highlight the impact of MapMerge in novel scenarios,going beyond the original evaluation.

5.3 Comparison with ++SpicyDespite producing target instances smaller than Clio, Map-

Merge may not produce core solutions. Here, we include acomparison with ++Spicy [16]. More precisely, we presenttwo experiments: (i) we first compare the core solutionsproduced by ++Spicy with the MapMerge and Clio solu-tions [17], and (ii) we independently study the impact of us-ing egds in the mapping, an important ++Spicy feature [15].For (i), we took 10 instances of each Ontology primitive, andsource relations with (3-7) attributes and 100 tuples. We fo-cus on the Ontology scenarios that have both more keys andmore nulls, making the comparison with MapMerge clearer.We used ++Spicy in its default configuration i.e., with corerewriting but without egd rewriting. In Figure 8, we reportthe generated target instance size and the relative improve-

10K

20K

30K

40K

50K

0C M S C M S C M S C M S

Targ

et

Siz

e (

# A

tom

s)

% Constraints

Constants Nulls

40302010

(a) % Constraints

0

0.2

0.4

0.6

0.8

1

10 20 30 40Rel. im

pro

vem

en

t (w

.r.t

. C

lio)

% Constraints

M-TotalM-NullsS-TotalS-Nulls

(b) Rel. Improv.: 8(a)

10K

20K

30K

40K

50K

0C M S C M S C M S C M S

Targ

et

Siz

e (

# A

tom

s)

% Target Sharing

Constants Nulls

6040200

(c) Target Sharing

0

0.2

0.4

0.6

0.8

1

0 20 40 60Rel. im

pro

vem

en

t (w

.r.t

. C

lio)

% Target Sharing

M-TotalM-NullsS-TotalS-Nulls

(d) Rel. Improv.: 8(c)

10K

20K

30K

40K

50K

0C M S C M S C M S C M S

Targ

et

Siz

e (

# A

tom

s)

% Source Sharing

Constants Nulls

6040200

(e) Source Sharing

0

0.2

0.4

0.6

0.8

1

0 20 40 60Rel. im

pro

vem

en

t (w

.r.t

. C

lio)

% Source Sharing

M-TotalM-NullsS-TotalS-Nulls

(f) Rel. Improv.: 8(e)

Figure 8: Evaluation of ++Spicy (S)

ment w.r.t. Clio for both MapMerge and ++Spicy, for threedifferent settings: we vary the amount of source and targetINDs (8(a) and 8(b)), of target sharing (8(c) and 8(d)), andsource sharing (8(e) and 8(f)), respectively (variations overthe same ranges while keeping the other parameters fixed tothe same values as in Section 5.2). The size of the instancegenerated by ++Spicy remains constant regardless of therandom constraints (cf. 8(a)). This is in contrast to bothClio and MapMerge where the size increases. This indicatesthat the core solutions of ++Spicy are effectively removingredundancy created by chasing over these additional INDs.The relative improvement is always greater for ++Spicycompared to MapMerge (cf. 8(b), 8(d), and 8(f)). Thesebenefits come naturally with a time cost: while generatingClio or MapMerge mappings took up to 8 sec, generatingthe ++Spicy mappings took up to 105 sec.

5.4 Evaluation of ++Spicy Using UDPsIn our second ++Spicy experiment, we quantified the

benefits of using egds in mapping creation. Neither Clionor MapMerge uses egds, hence we compared the solutiongiven by ++Spicy in the presence of egds with its canon-ical and core solutions. The advantages of using egds canbe seen in many integration scenarios generated by iBench,but to best highlight the gains, we report on one exper-iment in which we created a specially designed UDP. TheUDP contains source address information shred across threerelations that needs to be combined into a single target re-

11

Page 12: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

0

10K

20K

30K

40K

50K

60K

70KTa

rget

Siz

e (

# A

tom

s)

% Target Relations with Keys

Constants Nulls

CaCoE100

CaCoE75

CaCoE50

CaCoE25

CaCoE0

(a) Target Instance Size

0

0.2

0.4

0.6

0.8

1

0 25 50 75 100

Rel. Im

pro

vem

ent

% Target Relations with Keys

Co/Ca-TotalCo/Ca-Nulls

E/Ca-TotalE/Ca-NullsE/Co-TotalE/Co-Nulls

(b) Rel. Improv.: 9(a)

Figure 9: ++Spicy: Canonical Solution (Ca), CoreSolution (Co), Solution in the Presence of EGDs (E)

lation. The UDP is easy to create (the UDP code tookonly a few minutes to write and is 79 lines of XML thatare shown in Figure 11). The source has three relationsPerson(name, address), Address(occupant, city) and Place

(occupant, zip), and two FKs (from the two occupant at-

tributes to Person.name). In the target, we have one rela-tion LivesAt(name, city, zip). There is a correspondencetowards each target attribute from the source attribute hav-ing the same name. In the experiment, we invoked the UDP40 times with source relations containing 100 tuples, andwe varied the number of target relations having a primarykey from 0 to 100%. We report the generated target sizeand the relative improvement in Figure 9(a) and 9(b), re-spectively. The canonical and core solutions have constantsizes. As the number of keys increases, ++Spicy uses themon more and more relations to remove incomplete redundanttuples as expected. Notice that iBench makes the generationof such an experiment very easy. Once the UDP is created,iBench can apply it on data instances of different sizes andcharacteristics (for ++Spicy this would include reuse of val-ues) and can invoke it an arbitrary number of times withother primitives or with additional (arbitrary) mappings.Despite the fact that the complexity of the mapping genera-tion problem is hard in the presence of UDPs, iBench makesit easy to create and apply them flexibly to quickly developrobust evaluations.

6. CONCLUSIONDeveloping a system like iBench is an ambitious goal. We

presented the first version of iBench. We used the systemto conduct a new comparative evaluation of MapMerge and++Spicy, systems that had been evaluated individually butnot compared directly. We also presented a new evalua-tion of ++Spicy to quantify the influence of target key con-straints on the quality of their solution, an evaluation notdone by the authors for lack of appropriate metadata gen-eration tools. Moreover, iBench was an essential tool in alarge scale empirical evaluation we have conducted in previ-ous work [5]. Our hope is that iBench will be a catalyst forencouraging more empirical work in data integration and atool for researchers to use in developing, testing and sharingnew quality measures. In the future, we will extend the pro-totype with new functionality (with the community help).

Acknowledgements. We thank Alexe and Popa for shar-ing MapMerge, and Mecca and Papotti for helping us use++Spicy, and all for verifying our experimental conclusions.

7. REFERENCES[1] B. Alexe, M. A. Hernandez, L. Popa, and W.-C. Tan.

MapMerge: Correlating Independent SchemaMappings. VLDB J., 21(2):191–211, 2012.

[2] B. Alexe, W.-C. Tan, and Y. Velegrakis.STBenchmark: Towards a Benchmark for MappingSystems. PVLDB, 1(1):230–244, 2008.

[3] B. Alexe, B. ten Cate, P. G. Kolaitis, and W.-C. Tan.Designing and Refining Schema Mappings via DataExamples. In SIGMOD, pages 133–144, 2011.

[4] M. Arenas, J. Perez, J. L. Reutter, and C. Riveros.The Language of Plain SO-tgds: Composition,Inversion and Structural Properties. J. Comput. Syst.Sci., 79(6):763–784, 2013.

[5] P. C. Arocena, B. Glavic, and R. J. Miller. ValueInvention in Data Exchange. In SIGMOD, pages157–168, 2013.

[6] D. Barbosa, A. Mendelzon, J. Keenleyside, andK. Lyons. ToXgene: An Extensible Template-basedData Generator for XML. In WebDB, pages 49–54,2002.

[7] P. A. Bernstein, T. J. Green, S. Melnik, and A. Nash.Implementing Mapping Composition. VLDB J.,17(2):333–353, 2008.

[8] I. Boneva, A. Bonifati, and R. Ciucanu. Graph DataExchange with Target Constraints. In EDBT/ICDTWorkshops–GraphQ, pages 171–176, 2015.

[9] R. Fagin. Inverting Schema Mappings. TODS,32(4):24, 2007.

[10] R. Fagin, L. M. Haas, M. A. Hernandez, R. J. Miller,L. Popa, and Y. Velegrakis. Clio: Schema MappingCreation and Data Exchange. In Conceptual Modeling:Foundations and Applications, pages 198–236, 2009.

[11] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa.Data Exchange: Semantics and Query Answering.Theor. Comput. Sci., 336(1):89–124, 2005.

[12] R. Fagin, P. G. Kolaitis, L. Popa, and W.-C. Tan.Composing Schema Mappings: Second-OrderDependencies to the Rescue. TODS, 30(4):994–1055,2005.

[13] J. Hammer, M. Stonebraker, and O. Topsakal.THALIA: Test Harness for the Assessment of LegacyInformation Integration Approaches. In ICDE, pages485–486, 2005.

[14] M. Lenzerini. Data Integration: a TheoreticalPerspective. In PODS, pages 233–246, 2002.

[15] B. Marnette, G. Mecca, and P. Papotti. Scalable DataExchange with Functional Dependencies. PVLDB,3(1):105–116, 2010.

[16] B. Marnette, G. Mecca, P. Papotti, S. Raunich, andD. Santoro. ++Spicy: an OpenSource Tool forSecond-Generation Schema Mapping and DataExchange. PVLDB, 4(12):1438–1441, 2011.

[17] G. Mecca, P. Papotti, and S. Raunich. Core SchemaMappings. In SIGMOD, pages 655–668, 2009.

[18] R. J. Miller, D. Fisla, M. Huang, D. Kymlicka, F. Ku,and V. Lee. The Amalgam Schema and DataIntegration Test Suite.www.cs.toronto.edu/~miller/amalgam, 2001.

[19] D. Patterson. Technical Perspective: For Better orWorse, Benchmarks Shape a Field. CACM, 5(7), 2012.

12

Page 13: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

[20] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez,and R. Fagin. Translating Web Data. In VLDB, pages598–609, 2002.

[21] R. Pottinger and A. Halevy. MiniCon: A ScalableAlgorithm for Answering Queries using Views. VLDBJ., 10(2-3):182–198, 2001.

[22] A. Rajaraman and J. D. Ullman. IntegratingInformation by Outerjoins and Full Disjunctions. InPODS, pages 238–248, 1996.

[23] B. ten Cate and P. G. Kolaitis. Structuralcharacterizations of schema-mapping languages.Commun. ACM, 53(1):101–110, 2010.

[24] C. Yu and L. Popa. Semantic Adaptation of SchemaMappings when Schemas Evolve. VLDB, pages1006–1017, 2005.

APPENDIXA. PRIMITIVE TYPES AND PARAMETERS

A.1 Supported Primitive TypesA list of all primitives supported by iBench is shown in

Figure 10.

A.2 Scenario Parameters and Primitive Pa-rameters

The scenario parameters and primitive parameters sup-ported in the current version of iBench are shown in Tables 4and , respectively.

B. PROOFS

Theorem 2.16 The metadata-generation problem is NP-hard.

Proof. We prove that the metadata-generation problemis NP-hard in the number of primitive types. To see whythis is the case consider a reduction from the decision ver-sion of the unbounded knapsack problem: given a set ofitems I = {i1, . . . , in} with associated weight wi and valuevi, does there exist a multiset of items from I representedas a vector multiplicities xi such that the total weight of theitems is below or equal to a threshold W (

∑ni=0 xi ·wi ≤W )

and the total value of these items is above a threshold V(∑ni=0 xi · vi ≥ V ). We encode an input knapsack problem

as a metadata-generation problem as follows. Each item isencoded as a UDP that maps wi source relations to vi targetrelations. We set the number of attributes per source/tar-get relation to one (πSourceRelSize = πTargetRelSize = 1)and disallow any source or target sharing πSourceShare =πTargetShare = 0. The weight threshold W is encoded bysetting πSourceSchemaSize = [0,W ]. The value threshold isencoded by setting πTargetSchemaSize = [V,∞]. Finally, wedisallow random mapping πRandomMappings = false. If asolution to this metadata-generation problem exists, thenand only then the corresponding knapsack problem has asolution with a total value above or equal to V . To see whythis the case consider a solution to the metadata-generationproblem. We determine xi based on the number of instancesof UDP representing the ith item.

Theorem B.1. There exist MD tasks for which no solu-tion exists.

Proof. The following task has no solution: no targetsharing, (πTargetSharing = 0%), two CP primitive instances.(θCP = 2), and a target schema size of 1 (πTargetSchemaSize =1). There is no integration scenario that is a solution for thistask, because for a scenario to contain two instances of theCP primitive with no sharing of target relations, the scenariohas to contain at least two target relations.

Theorem 2.17 Let Γ be a task and M a scenario. Theproblem of checking whether M |= Γ is NP-hard.

Proof. The theorem can be proven through a reductionfrom the subgraph isomorphism problem. Given two graphsG = (V,E) and H = (V ′, E′), does there exist a subgraphof G that is isomorphic to H. We construct a task Γ andscenario M such that if the scenario is a solution for thetask (M |= Γ) iff there exists an isomorphic subgraph. Thegraph G is encoded in M as follows. We create a sourcerelation Rv with a single attribute for every vertex v in Vand a target relation Re for every edge e. If every edgee = (v, v′) we create two mappings ∀a : Rv(a) → Re(a)and ∀a : Rv′ → Re(a). We create UDP by encoding H inthe same fashion. The task Γ then requests at least oneinstance of this UDP. If M |= Γ, then there exists at leastone a set of mappings and relations from M that have thesame structure in M as in the UDP representing H whichis only possible if H is isomorphic to the subgraph of Grepresented by this mappings and relations.

Theorem 3.1 The MDGen algorithm runs in time polyno-mial in n where n is the maximum of the number of relationsand the number of mappings to be created.

Proof. We first discuss the complexity of instantiatinga single primitive (Algorithm 2). Determining the require-ments for a primitive can be done in constant time. Gener-ating or choosing source and target relations is linear in thenumber of such relations created (since the number of triesfor picking relations to share is limited by a constant andbecause we index relations in memory based on characteris-tics which are evaluated during picking). The same appliesfor adapting the templates for mappings, transformations,and FKs. In summary, instantiating primitives producesn source or target relations in time O(n). It remains tobe shown that the post-processing steps run in polynomialtime. For instance, for complex value invention picking ran-dom attributes is linear in the number of relations (becausewe can assume that the number of attributes per relation isbound by a constant). Replacing each “victim” with a sko-lem term can be done in linear time in the number of map-pings (or number of occurrences of this attribute if a hashmap between attributes and mappings is maintained). Sim-ilar arguments hold for the other postprocessing steps.

Since by design the algorithm may violate scenario re-strictions from an input task in order to produce a solutionwithout backtracking, a main question is what kind of cor-rectness guarantees can be given for the solution returnedby the algorithm under such relaxations of the input restric-tions. First of, if the algorithm does not have to relax anyrestriction, then the scenario produced by the algorithm isa solution to the input task.

Theorem 3.2 Let Γ be a MD task. The MDgen algorithmfulfills the following correctness criteria:

13

Page 14: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Name Description Example SkolemsADD Copy a relation and add new attributes R(a, b)→ T (a, b, f(a, b)) VariableADL Copy a relation, add and delete attributes in tandem R(a, b)→ T (a, f(a)) VariableAS Add new source relation R(a, . . . ) -AT Add new target relation T (a, . . . ) -CP Copy a relation R(a, b)→ T (a, b) -DL Copy a relation and delete attributes R(a, b)→ T (a) -HP Horizontally partition a relation into multiple relations R(a, b) ∧ a = c1 → T1(b)

R(a, b) ∧ a = c2 → T2(b)-

MA Inverse of vertical partitioning + adding attributes R(a, b) ∧ S(b, c)→ T (a, b, c, f(a, b, c)) VariableSource FK from S to R (e.g, S.b to R.b).

ME Inverse of vertical partitioning (merge) R(a, b) ∧ S(b, c)→ T (a, b, c) -Source FK from S to R (e.g., S.b to R.b).

OF Object fusion, e.g., inverse of horizontal partitioning R(a, b)→ T (a, b, f(a))S(a, c)→ T (a, g(a), c)R(a, b) ∧ S(a, c)→ T (a, b, c)

Fixed

SJ Copy relation (S) and create a relationship table (T )through a self-join

R(a, b, c)→ S(a, c)R(a, b, c) ∧R(b, d, e)→ T (a, b)

-

Source FK from R.b to R.aSU Copy a relation and create a surrogate key R(a, b)→ T (f(a, b), b, g(b)) Fixed/Variable

VA Vertical partitioning into N authority relationships R(a, b, c, d, e, f)→ A1(a, g(a)) ∧A11(b, g(a)) Fixed

(example mappings shown for N = 2) R(a, b, c, d, e, f)→ A1(a, g(a)) ∧A12(c, g(a))

R(a, b, c, d, e, f)→ A2(d, h(d)) ∧A21(e, h(d))

R(a, b, c, d, e, f)→ A2(d, h(d)) ∧A22(f, h(d))

Target PK for each Top Authority Relation (e.g.,A1 and A2), and also a target FK from eachSlave Relation (e.g., A11 and A12) to its TopAuthority (e.g, A1)

VH Vertical partitioning into a HAS-A relationship R(a, b)→ S(f(a), a) ∧ T (g(a, b), b, f(a)) Fixed

Target FK from T to S (e.g., T.f(a) to S.f(a))VI Vertical partitioning into an IS-A relationship R(a, b, c)→ S(a, b) ∧ T (a, c) -

Target FKs from S to T, and also from T to S(e.g., T.a to S.a and viceversa)

VNM Vertical partitioning into an N-to-M relationship R(a, b)→ S1(f(a), a) ∧M(f(a), g(b)) ∧ S2(g(b), b) Fixed

Target FKs from M to S1 and S2 (e.g., M.f(a)to S1.f(a) and M.g(b) to S2.g(b))

VP Vertical partitioning R(a, b)→ S1(f(a, b), a) ∧ S2(f(a, b), b) Variable

Target FKs from S1 to S2, and also from S2 toS1 (e.g., S1.f(a, b) to S2.f(a, b) and viceversa)

Figure 10: List of All Primitives Currently Supported by iBench

• If the MDgen algorithm returns a result without relaxingany restrictions given by the task, then the result is a solutionfor Γ.• A solution returned by the algorithm always conforms to

the primitive restrictions given in the input task.• Let Γrelax be the MD task generated by replacing the re-

strictions in the task Γ with their relaxed version produced bythe algorithm when running over Γ. The scenario returnedby the algorithm is a solution for Γrelax.

C. TRANSFORMATIONS AND OTHER AD-DITIONAL TYPES OF METADATA

D. EVALUATION

D.1 Metadata ScenariosSome additional detail on the two scenarios we used in

Section 5. We designed two scenarios using iBench. TheOntology scenarios consist of primitives that can be usedto map a relational schema to an ontology. Three of theseprimitives are different types of vertical partitioning, into aHAS-A, IS-A, or N-to-M relationship (VH, VI, VNM). Thefourth primitive is ADD (copies and adds new attributes).ADD creates one source relation (one key) and one targetrelation (one key). VH creates one source relation and two

target relations (two keys and one FK). VI creates one sourcerelation (one key) and two target relations (one key and oneFK). VNM creates one source relation (one key) and threetarget relations (two keys and two FKs). The STB scenar-ios consist of primitives supported by STBenchmark [2]: CP(copy a relation), VP (vertical partitioning), HP (horizontalpartitioning), and SU (copy a relation and create a surro-gate key). CP creates one source relation (one key) and onetarget relation (one key). VP creates one source relation(one key) and two target relations (two keys and one FK).HP creates one source relation and two target relations. SUcreates one source relation and one target relation (one key).The Ontology scenarios produce at least twice as many valueinventions (referred to hereafter as nulls) as the STB sce-narios (e.g., one instance of each ontology primitive - ADD,VH, VI, and VNM - yields 8 nulls, while one instance of eachSTB primitive - CP, VP, HP, and SU - only yields 4 nulls).The Ontology scenarios also include more source and targetkeys compared to STB.

D.2 ++Spicy UDPThe code for the ++Spicy UDP (Section 5.4) is given in

Figure 3.

14

Page 15: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

<?xml ve r s i on=” 1 .0 ” encoding=”UTF−8”?><th i s :MappingScenar io xmlns : th i s=”org /vagabond/xmlmodel”

xmlns :x s i=” ht tp : //www.w3 . org /2001/XMLSchema−i n s tance ”><Schemas>

<SourceSchema><Relat ion name=”person ”>

<Attr><Name>name</Name><DataType>TEXT</DataType></Attr><Attr><Name>address</Name><DataType>TEXT</DataType></Attr><PrimaryKey><Attr>name</Attr></PrimaryKey>

</Relat ion><Relat ion name=” address ”>

<Attr><Name>z ip</Name><DataType>TEXT</DataType></Attr><Attr><Name>c i t y</Name><DataType>TEXT</DataType></Attr><PrimaryKey><Attr>z ip</Attr></PrimaryKey>

</Relat ion><ForeignKey id=” person add fk ”>

<From t ab l e r e f=”person ”><Attr>address</Attr></From><To t a b l e r e f=” address ”><Attr>z ip</Attr></To>

</ForeignKey></SourceSchema><TargetSchema>

<Relat ion name=” l i v e sAt ”><Attr><Name>name</Name><DataType>TEXT</DataType></Attr><Attr><Name>c i t y</Name><DataType>TEXT</DataType></Attr><Attr><Name>z ip</Name><DataType>TEXT</DataType></Attr><PrimaryKey><Attr>z ip</Attr></PrimaryKey>

</Relat ion></TargetSchema>

</Schemas><Correspondences>

<Correspondence id=”c1”><From t a b l e r e f=”person ”><Attr>name</Attr></From><To t a b l e r e f=” l i v e sAt ”><Attr>name</Attr></To>

</Correspondence><Correspondence id=”c2”>

<From t a b l e r e f=” address ”><Attr>z ip</Attr></From><To t a b l e r e f=” l i v e sAt ”><Attr>z ip</Attr></To>

</Correspondence><Correspondence id=”c3”>

<From t a b l e r e f=” address ”><Attr>c i t y</Attr></From><To t a b l e r e f=” l i v e sAt ”><Attr>c i t y</Attr></To>

</Correspondence></Correspondences><Mappings>

<Mapping id=”M1”><Uses>

<Correspondence r e f=”c1” /><Correspondence r e f=”c2” /><Correspondence r e f=”c3” />

</Uses><Foreach>

<Atom t ab l e r e f=”person ”><Var>a</Var><Var>b</Var></Atom><Atom t ab l e r e f=” address ”><Var>b</Var><Var>c</Var></Atom>

</Foreach><Exi s t s>

<Atom t ab l e r e f=” l i v e sAt ”><Var>a</Var><Var>b</Var><Var>c</Var>

</Atom></ Ex i s t s>

</Mapping><Mapping id=”M2”>

<Uses><Correspondence r e f=”c2” /><Correspondence r e f=”c3” />

</Uses><Foreach>

<Atom t ab l e r e f=” address ”><Var>a</Var><Var>b</Var></Atom></Foreach><Exi s t s>

<Atom t ab l e r e f=” l i v e sAt ”><SKFunction skname=”SK1”>

<Var>b</Var><Var>c</Var></SKFunction><Var>a</Var><Var>b</Var>

</Atom></ Ex i s t s>

</Mapping></Mappings></ th i s :MappingScenar io>

Figure 11: XML code for ++Spicy UDP

15

Page 16: The iBench Integration Metadata Generatordblab.cs.toronto.edu/project/iBench/docs/iBench-TR-2015.pdf · The iBench Integration Metadata Generator ... the TPC benchmarks, ... TPC-DI,

Parameter Description Default

Schema Size

πSourceRelSize Number of attributes per source relation [3,3]

πTargetRelSize Number of attributes per target relation [3,3]

πSourceSchemaSize Number of relations in source chema [10,10]

πTargetSchemaSize Number of relations in target schema [10,10]

Mapping Characteristics

πNumMaps Number of mappings [0,∞]

πSourceMapSize Number of source atoms per independent mapping [1,3]

πTargetMapSize Number of target atoms per independent mapping [1,1]

πMapJoinSize Number of attributes shared by two atoms in join expressions [1,1]

Constraints

πSourceFDs Percentage of FDs per source relation 0

πPrimaryKeyFDs Generation of PK FDs, on or off Off

πPrimaryKeySize Size of PKs per source relation 0

πSourceINDs Percentage of INDs per source relation 0

πSourceCircularINDs Percentage of circular INDs per source relation 0

πSourceFKs Percentage of INDs that are FKs per source relation 0

πSourceCircularFKs Percentage of circular INDs that are FKs per source relation 0

πTargetINDs Percentage of INDs per target relation 0

πTargetCircularINDs Percentage of circular INDs per target relation 0

πTargetFKs Percentage of INDs that are FKs per target relation 0

πTargetCircularFKs Percentage of circular INDs that are FKs per target relation 0

Mapping Incompleteness

πNumSkolem Percentage of target variables not used in source 0

πSkolemType Type of value invention (All, Key, Random) All

πSkolemSharing Percentage of Skolem facts shared among mappings 0

Sharing

πSourceSharing Percentage of mappings that share source relations [0,0]

πTargetSharing Percentage of mappings that share target relations [0,0]

Instance Data

πRelInstSize Number of tuples per relation instance. [5,5]

πIndSourceInst Percentage of source instance data that is independent of the target, i.e., will be ignored bytransformations implementing the mapping.

[0,0]

πIndTargetInst Percentage of target instance data that is created independently of the source (not the resultof exchanging source data).

[0,0]

Metadata Renaming

πRename Percentage of target metadata that is renamed (0,0)

Table 4: Scenario Parameters and Defaults

Parameter Description Default

λNumAttrAdd Number of attributes added by primitives that create new attributes in the target (ADD, ADL, MA) [1,1]

λNumAttrDel Number of attributes deleted by primitives that remove attributes in the target (DL) [1,1]

λNumPartitions Number of partitions for vertical and horizontal partitioning, and merging primitives [2,2]

λJoinKind Type of joins (0=Star, 1=Chain) used to construct source/target expressions 1

λV AComplexity Number of authority hierarchies in the target schema (VA) 0

Table 5: Primitive Parameters and Defaults

16