47
A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised by Dr. Rajendra K. Raj Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, New York September 2010

A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

A framework to test schema matchingalgorithms

by

Bhavik Doshi

A Project Report Submittedin

Partial Fulfillment of theRequirements for the Degree of

Master of Sciencein

Computer Science

Supervised by

Dr. Rajendra K. Raj

Department of Computer Science

B. Thomas Golisano College of Computing and Information SciencesRochester Institute of Technology

Rochester, New York

September 2010

Page 2: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

Project Report Release Permission Form

Rochester Institute of TechnologyB. Thomas Golisano College of Computing and Information Sciences

Title: A framework to test schema matching algorithms

I, Bhavik Doshi, hereby grant permission to the Wallace Memorial Library to reproducemy project in whole or part.

Bhavik Doshi

Date

Page 3: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

iii

The project “A framework to test schema matching algorithms” by Bhavik Doshi hasbeen examined and approved by the following Examination Committee:

Dr. Rajendra K. RajProfessorProject Committee Chair

Dr. Carol RomanowskiAssistant Professor

Dr. Trudy M. HowlesAssociate Professor

Page 4: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

iv

Acknowledgments

This project would not have been possible without the support of many people. I amdeeply grateful to my project committee for their constant direction, support and guidance.

Dr. Rajendra K. Raj was instrumental in guiding me in the right direction as the commit-tee chair for my project and was always providing his valuable guidance into my project.Dr. Raj also tracked my project progress and helped me in giving the right direction to myproject.

Dr. Carol Romanowski was always ready to help me out by providing valuable sugges-tions to improve on my ideas of implementing the proposed project.

I especially would like to thank Dr. Trudy Howles for giving my project another per-spective and helping me in improving my project report.

- Bhavik Doshi (RIT CS 2010)

Page 5: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

v

Abstract

A framework to test schema matching algorithms

Bhavik Doshi

Supervising Professor: Dr. Rajendra K. Raj

Schema matching plays an important role in the architecture of data integration and isthe process of identifying semantically related objects. Schema matching can be describedas a process in which source schema elements are mapped with the target schema match-ing elements. It plays a critical role in enterprise information integration and has beena popular data management research topic, particularly in building data warehouses andmarts. Due to the subjective nature of schema matching, automating the process becomescomplex to achieve, though efforts have been made to make it semi-automatic. In additiontraditional techniques take advantage of only one of the aspects from syntax, semantics ordata and their probability distribution. By exploiting individual features it becomes difficultto increase the success rate as each approach is implemented independently of each other.

The latest development in this field is the use of an holistic approach for schema match-ing which is domain independent and works on the principle of integrating different matchprocesses. It now became essential to test the feasibility of these approaches in real worldschemas and examine their behavior. Initial results show that as the data similarity andnumber of instances vary the results of each of these methods vary. To address these issues,this project proposes and develops a framework to test the viability of traditional and theholistic approaches in the real-world scenario. It assesses system and matching efficiencybased on evaluation metrics and other schema parameters.Furthermore, this project usesdesign of experiments to test the methods and draw statistical conclusions.

Page 6: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

vi

Page 7: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Instance based schema matching . . . . . . . . . . . . . . . . . . . 21.2.2 Element Level Schema Matching . . . . . . . . . . . . . . . . . . 41.2.3 Holistic Schema Matching . . . . . . . . . . . . . . . . . . . . . . 51.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.5 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.6 RoadMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Design considerations in Sinha et al.[11] . . . . . . . . . . . . . . 112.1.2 Design considerations in this project . . . . . . . . . . . . . . . . . 13

2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Data Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Matching Generator . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 True Matching Vector . . . . . . . . . . . . . . . . . . . . . . . . 162.2.4 Evaluation Metrics Generator . . . . . . . . . . . . . . . . . . . . 16

3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Testing design and Methodology . . . . . . . . . . . . . . . . . . . . . . . 17

4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Analysis Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Page 8: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

viii

4.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.1 Evaluating Precision versus Method, Field Similarity and Instances 204.3.2 Residual Plots for Precision . . . . . . . . . . . . . . . . . . . . . 224.3.3 Main Effects Plot for Precision . . . . . . . . . . . . . . . . . . . . 224.3.4 Interaction Plot for Precision . . . . . . . . . . . . . . . . . . . . . 24

4.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.1 Evaluating Recall versus method, field similarity and instances . . . 254.4.2 Residual Plots for Recall . . . . . . . . . . . . . . . . . . . . . . . 274.4.3 Rerunning the analysis for log (Recall) . . . . . . . . . . . . . . . 284.4.4 Main Effects Plot for log (Recall) . . . . . . . . . . . . . . . . . . 294.4.5 Interaction Plot for log (Recall) . . . . . . . . . . . . . . . . . . . 31

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Code Listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 9: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

ix

List of Tables

4.1 General Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Page 10: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

x

List of Figures

1.1 Example used in [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Matching generated for Source and Target Schemas from Dependency Graphs

[11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Holistic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Design consideration in [11] . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Design consideration in this project . . . . . . . . . . . . . . . . . . . . . 13

3.1 Testing design generated from Minitab . . . . . . . . . . . . . . . . . . . . 17

4.1 Analysis of Variance for Precision, using Adjusted Sum of Squares(SS) forTests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Residual Plots for Precision . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Main Effects Plot for Precision . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Interaction Plot for Precision . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Analysis of Variance for Recall, using Adjusted Sum of Squares(SS) for Tests 254.6 Residual Plots for Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.7 Analysis of Variance for log(Recall), using Adjusted Sum of Squares(SS)

for Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.8 Residual Plots for log(Recall) . . . . . . . . . . . . . . . . . . . . . . . . . 294.9 Main Effects Plot for log(Recall) . . . . . . . . . . . . . . . . . . . . . . . 304.10 Interaction Plot for log(Recall) . . . . . . . . . . . . . . . . . . . . . . . . 31

Page 11: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

1

Chapter 1

Introduction

1.1 Introduction

Because of the complex nature of schema matching, it has become one of the mostimportant steps in the data integration process. In addition to this schema matching alsoplays a decisive role in enterprise information integration making it a prime topic in datamanagement research [10]. Schema matching can be defined as, given source and tar-get schemas, matching maps source elements to target elements [11]. Automation of theschema matching process becomes a difficult task to achieve due to the subjective natureof schema matching but constant efforts are being put to make it semi-automatic. Tradi-tional schema matching techniques have exploited individual features of schema structure,syntax, semantics and the data and probability distributions.

Sinha et al.[11] proposes a holistic domain independent approach to schema matching,which generalizes the process by integrating different match processes. The holistic ap-proach uses two methods, one from Kang et al.[5] which is an instance based approachand other from Li et al.[6] which is an element based approach. The instance based ap-proach proposes a two step technique and uses data mutual information and probabilitydistributions to perform matching in spite of the presence of opaque column names anddata values. On the other hand the element based approach proposes a matching tool whichautomates the schema matching between the source and target schemas and is based onstructures, constraints and other elements. The holistic approach is based on combinationof both instance and element level approaches and develops a generic holistic techniquewhich is domain independent for relational databases. This approach uses techniques fromboth the above methods along with semantics and structural aspects of relational databasesto develop a holistic approach. With databases poorly built in the real world it becomesdifficult to do schema matching as each approach generates different results based on rela-tional design, data values and database object names. Precision and recall were the metricsused and java scripts were written to test the accuracy of the algorithms.

Page 12: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

2

Initial tests have been made on all of these approaches by Sinha et al.[11] and theyillustrate that holistic approach works well in some cases against traditional approaches.However these tests do not lead to concrete conclusions and hence it becomes crucial totest the algorithms for multiple large datasets with varying parameters like field similarityand number of instances so as to entirely test the behavior of the algorithms under varyingschematic conditions. In order to assist in testing, a framework has also been proposedin this project which takes in the dataset names and the actual matching and generatesmultiple results by varying dataset parameters. The framework semi-automates the processas it calculates the evaluation metrics for a defined field similarity and increasing numberof instances. Four replications of a full factorial experiment have been performed andanalyzed using Analysis Of Variance (ANOVA).

1.2 Background

Sinha et al.[11] implements three schema matching algorithms in their proposed workand this project uses the approaches to test the quality of matching in real world scenarios.In order to better understand the testing and analysis done in this project we first need tohave a brief overview of all the three algorithms. The rest of the chapter concentrates ongiving an insight on the three approaches.

1.2.1 Instance based schema matching

This approach has been proposed by Kang and Naughton [5] which takes the probabilitydistribution of the data in the relations into consideration and ignores the actual data valuesas well as the names of the attributes. They propose a two step technique that works even inpresence of opaque column names and data values, thus assuming that the data is not in thestandard format and has poor quality. Instance based schema matching is a modification ofthe traditional instance based approach where actual instances are compared to determinethe matching, instead this technique works where the latter fails. This is because of twomain factors, one it does not rely on any interpretation of data values and second it considerscorrelations among the columns in each table [5].

By treating column names and data values as opaque the instance based technique helpsin finding correct matches even if the database is weakly named and has mismatched datatypes. For example in a particular matching the source attribute might be numeric in typewhile the target might be character in type, however they do represent the same entity. Itis not possible for a constraint based matcher to find such a similarity. The first step in thisalgorithm is to measure the pair wise attribute correlations and construct a schema graph

Page 13: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

3

from the source and target schemas. The simplest definition of a dependency graph is itcan be termed as a graphical representation of a database. In a dependency graph verticescorrespond to relation and relationship tables from schema and the edges are termed asreferential integrity constraints in the schema. The next step is to create dependency graphsfrom source and target schemas.

Prior to the creation of the dependency graphs one needs to understand some statisticalcalculations which constitute the vertex and edge weights of the dependency graphs. Forgenerating the graphs the first step was to calculate marginal probability for each attributein the relational schema followed by joint probability and entropy. The next set was tocalculate the pairs of attributes which makes the order of computation to be O ( n 2 * m)where n is the number of attributes and m is the number of rows in the given schema. Theycan be defined as conditional probability and mutual information. The complex part inthis algorithm is that the above computations take several traversals of data. For the samepurpose Sinha et al. [11] describes a way to reduce computations by storing the marginaland joint probabilities in a hashed data structure so as to have constant look up time. Thusthere requires only one traversal of data and all the other measures are calculated usingthese counts. This project uses the instance approach developed in [11].

The next step in this algorithm is to construct dependency graphs for both source andtarget schemas. We can explain this construction with the same example used in Kanget al. [5]. Let us consider two relations, one for the source schema and the other for thetarget schema. These are illustrated in the figure 1.1. Rows and columns in both theschemas are meaningless as this algorithm ignores column names and data values.

Figure 1.1: Example used in [5]

As per the dependency graph formula defined in [5] the tables of the source and targetschemas are converted into their corresponding dependency graphs. After generating thedependency graphs the obvious next step is to find the optimal matching between the source

Page 14: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

4

and target schemas. For this Kang et al.[5] proposes an evaluation metric which can judgethe quality of the matching. Sinha et al.[11] implements four distance metrics defined in[5] to generate the best possible matching.

The instance based approach calculates the score over each possible set of matching andthen selects the optimal score generating matching as the final matching which is eventuallypresented to the user. Based on the formulas above it is clear that injective mapping metricsare faster in comparison to the metrics used for partial matching as the order of computationincreases.

Figure 1.2: Matching generated for Source and Target Schemas from Dependency Graphs[11]

1.2.2 Element Level Schema Matching

Sinha et al.[11] implements the elemental based approach defined in [6] but with cer-tain modifications which makes it more generic and applicable to a larger set of relationaldatasets. In order to find the matches this algorithm uses information primarily at the re-lational level and also some attribute level information. Li et al.[6] defines this algorithmwith one major restriction being that the target schema needs to be a data warehouse withstar schema. The main reason being that the first step of this approach is to identify thefact candidate that corresponds to the fact table of the target schema. But Sinha et al.[11]removes this restriction by selecting the relation with the maximum number of related re-lations as the pseudo fact table of the target schema.

Page 15: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

5

The element level schema matching algorithm implemented in [11] starts the match find-ing process with the initial step called as Fact Matching [6]. The prerequisite of fact match-ing is that it requires the fact table of the target schema based data warehouse and thesource schema. This is performed by the Fact Matcher object in the implementation of thisapproach. It calculates the score of each source table. In the end of this process a factcandidate is found which constitutes the root of the binary source schema tree. The factcandidate from the source schema corresponds to the fact table of the target schema andtherefore it is the prime table of the target schema.

After finding the fact candidate the algorithm converts the source schema in to a BinarySchema Tree [6]. This is a customized data structure defined by Lingmei and Yang [6] intheir proposed algorithm. In such a data structure the root of the tree corresponds to therelation determined by the fact matching process. Also the representation of the tree is aleft child right sibling representation. In order to have better matching this algorithm uses abinary source tree representation instead of the database schema representation of an n-arytree. In order to do so, the algorithm first uses the relation identified by the fact matchingprocess as the root and the converts the rest of the relations to a binary source tree.

The algorithm now generates a binary tree representation of the n-ary tree of the sourceschema. For this the algorithm first converts the source schema to a multi-way tree. Thealgorithm uses the construction in [6] for the above process. Also the relationships betweenthe parent and child are inverse than the relationship from the source schema when weconvert it into a multi way or an n-ary tree. After this the algorithm converts this tree into afirst child-next sibling representation by making the left most child as the first child and allother children for the right set of siblings. Finally this tree is converted into a binary treeusing the left child right sibling approach.

The last step is repeated untill each node either contains one of the following things;a child, a child and a sibling or nothing. After converting the source schema to a binaryrepresentation the final step in this algorithm is dimension matching. In this we match everydimension from the target star schema to the relations of the source schema and a score foreach match is calculated. The matching with the highest score then updates the binary treeas it eliminates the parts from the source schema tree corresponding to the match.

1.2.3 Holistic Schema Matching

The reason this algorithm was proposed in [11] was to take every characteristic of rela-tional database as well as data base objects into consideration for the matching generation

Page 16: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

6

process. The algorithm takes the weighted average of all the approaches to generate thefinal matching. For this the algorithm uses the above two discussed approaches by mod-ifying them which helps in generating a better matching. The holistic approach can bedefined as a matching process which utilizes an elemental level matcher, an instance levelmatcher; semantic matching and constraint based matching to generate the final matching.We discuss all these techniques in brief later in this chapter.

The proposed holistic approach makes the matching process as generic as possible andthus it can be used for any relational schemas not worrying about the naming convention,relation design and domain knowledge. The holistic approach also comes with a graphicaluser interface which makes it easy for data analyst to view the matching process and changeweights and parameters. Also this approach can run the instance and elemental levelsapproaches individually and compare their matching with the holistic one. Initial resultsshow that the holistic approach does better than the instance and elemental approacheswhen it comes to recall and better than instance when it comes to precision. We analyzeand verify these results later in this project.

The holistic algorithm makes a start with the instance based approach [5] and limitsthe number of connections in the dependency graph. A dependency graph for a particularschema can be defined as the complete graph consisting of calculated mutual informationbetween each pair of nodes. Mutual information can be termed as a constructive metricas it represents all the information a node knows about the other, but on the other handcalculating for all pairs would slow down the algorithm drastically. Here the holistic ap-proach would just draw an edge if either the nodes belong to the same relation or they sharea referential integrity constraint. Thus this reduces the computation of the algorithm to aconsiderable extent and in addition also allows the structural properties and relationshipsof the schema to contribute to the final matching [11].

In the next step the holistic approach revisits the elemental based approach by removingthe star schema restriction thus setting the relation in the schema graph to the highest degreeas the fact table. It also restricts the matching of similar data types by assigning valuesranging from 0 to 1 depending on the extent of matching. Thus the holistic approachuses both the above approaches with many other optimizations to improve the extent andquality of schema matching. With all this the holistic approach also takes into considerationsemantic similarity and constraint based methods which contribute to the matching in theform of weighted average.

Page 17: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

7

Figure 1.3: Holistic Matching

The next method implemented in the holistic approach is semantic similarity which ispresent between the attribute names. The holistic approach uses an online thesaurus toidentify if the two attributes are semantically similar by checking if they are synonyms andif yes receive a score of one or zero.

The final technique that is used is constraint based and it scrutinizes the range of the datavalues present in the relations. If the data types of the two attributes are different then theyare not compared further and the match pair receives a score of zero. However if the datatypes are the same the extent of the range of the match is calculated and a score betweenzero and one is given to that pair.

1.2.4 Related Work

Schema matching has become an integral part of data management research and con-siderable amount of work has been done in the same. Apart from the instance based ap-proach by Kang [5] and elemental based approach by Li [6], several different matching

Page 18: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

8

algorithms have been developed. Melnik et al.[8] proposes a graph matching algorithm inwhich graphs are created from constraint and schema information. The major advantage ofsuch an approach is that it can handle non-relational datasets unlike most of the approaches.Rahm et al.[10] describes a complete taxonomy of various schema matching approachesand under which the holistic approach can be best described as the hybrid approach.

In their paper Li and Clifton [7] describe the use of neural networks to find semanticsimilarity between elements but on the downside this technique requires apt cleaning andtraining the dataset model before use. The algorithm developed by Palopoli et al.[9] de-scribes an interactive framework which creates a global integrated abstract schema. Thespecialty of such a framework is that it can extract mappings from multiple data sourcesand not just the source and target schemas. On the other hand this technique requires set-ting up of initial basic rules manually for the algorithm to process. One more element basedapproach apart from Li [6] is by Bohannon et al.[1] which makes use of logical annotationsfor schema matching. They call such a type of matching as contextual schema matching butwhen it comes to weakly named schemas, this technique fails to handle them using logicalannotations.

Another holistic approach is proposed by He and Chang et al.[3] in which they developa framework for matching query interfaces on the deep web. For the same purpose theyhave developed a different technique [4] which uses data mining techniques like clusteringto identify a match between similar columns. In such a framework the choices of distancemetric determine the quality of the result if the data quality is poor. Drumm et al.[2]proposes the use of ontologies to develop a QuicMig technique for schema matching. Usingontologies and thesaurus can improve the match quality to a considerable extent but at thesame time it is a tedious process and requires expertise and a lot of domain knowledge.

The holistic approach proposed by Sinha et al.[11] combines the instance and elementalbased approaches and develops a generic, integrated and a domain independent schemamatching technique. The mentioned framework used relational datasets and tests the accu-racy of the framework with the help of precision and recall metrics. The holistic approachfills gaps of both instance and elemental based approaches and gives a combined unifiedapproach to schema matching. I plan to use the mentioned architecture in [11] and proposeto test the schema matching algorithm for larger real-world relational datasets.

Page 19: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

9

1.2.5 Hypothesis

From the analysis made in [11] it seems there is a need to determine on the behaviorof all the three methods in real-world scenarios. All the three algorithms mainly dependon the two decisive factors when it comes to relational datasets namely, the number oftuples in the relations and also the field similarity in the schemas. Hence it becomes ofutmost importance to vary these factors and then analyze the performance and accuracyof the three algorithms. The number of tuples/instances are defined as the total number oftuples in every table of the source and target databases. On the other hand field similarity iscalculated by determining the total percentage of same named attributes from all the tablesof the source and target databases. The datasets used in this project were generated froman external data generator.

The main hypothesis of this project was to analyze the quality of matching generatedby all the three implemented approaches in [11] by defining a testing framework and byvarying decisive and sensitive parameters. To support the above stated hypothesis a testingframework was defined which calculates the decision making parameters i.e. precision andrecall for all the three approaches by increasing the number of tuples for a fixed value offield similarity. Furthermore the same tests were performed by manually changing the fieldsimilarity between the source and target schemas. Minitab was used to develop a testingdesign and the results were analyzed using the ANOVA method.

To test our main hypothesis we define three statistical hypothesis for each of the evalu-ation metric, precision and recall. Let us see in detail the three statistical hypotheses forprecision and the same follows for log of recall. In our analysis chapter we will see thereason on taking the log of recall as it generated better results than recall and did not have anon-constant variance issue with it. We test precision against three decisive factors Method,Field similarity and number of instances.

To test the factor method we define our null hypothesis H0 which states that there is nodifference between any of the three methods and they all are the same.

• H0 : µelemental = µinstance = µholistic

• H1 : There is a difference between the three methods or at least one method is differ-ent.

To test the factor field similarity we define our hypothesis H0 which states that there isno difference in the results when we increase the field similarity from 40

Page 20: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

10

• H0 : µ40% field similarity = µ80% field similarity

• H1 : There is a difference when we increase the field similarity.

To test the factor number of instances we define our hypothesis H0 which states that thereis no difference in the results when we increase the number of instances from 10 to 1000.

• H0 : µ no of instances = 10 = µ no of instances = 1000

• H1 : There is a difference when we increase the number of instances.

In the next chapters we will use the testing framework to test each of the above statedhypotheses for both the metrics precision and log (recall). Lastly, four replications of thetests (each iteration included 12 tests, so 48 tests in total) were performed and final resultswere analyzed using various plots for precision and recall separately.

1.2.6 RoadMap

The rest of the project report is organized as follows. Chapter 2 provides design andimplementation information. Testing methodologies and results are discussed in chapter 3.Future work and conclusions are described in chapter 4.

Page 21: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

11

Chapter 2

Design and Implementation

2.1 Design

This chapter is further divided into two sub-sections. Section 2.1.1 gives the system de-sign considerations taken into account by Sinha et al.[11] for implementing their proposedsystem. Section 2.1.2 talks about the design modifications and considerations taken intoaccount in this project for testing and analyzing the system with simulated datasets. I alsopropose a system implementation on top of the implemented system which automaticallyupdates the number of instances for both source and target schemas for a fixed value offield similarity as and when required.

2.1.1 Design considerations in Sinha et al.[11]

The holistic approach uses both the instance and elemental based techniques with mod-ifications to generate a better matching then the instance and elemental approaches. Thedesign used by the holistic approach can be seen in the figure 2.1. The design consistsof five modules which allow a database expert/analyst to use any of the three matchingtechniques. The remainder of this chapter will look at each of the modules in detail.

The first module is the database interface with the main purpose of dealing with thedatabase management system instance. The main responsibility of this layer is to storeand retrieve data or meta-data that is required by any of the three layers below. This layermakes the design loosely coupled and adds a layer of abstraction, thus reliving the othermodules from the complex task of collecting information from the data sources. This layermay seem to be a simple module, but it has a significant importance when it comes to thefunctioning of the entire system. One of the important functions this module has is to helpin the construction of the schema graph data structure by providing information which helpsin identifying the referential integrity constraints that constitute the edge of the graphs. Inorder to identify the referential integrity constraints information has to be gathered from

Page 22: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

12

Figure 2.1: Design consideration in [11]

the metadata of database objects.

For this project we have used SQL Server 2005 as our database repository and so thisinformation was collected from the sysobjects table. Also by separating the task of re-trieving information from the database repository we reduce the complexity of the systemthat if written as a part of individual module may result in a lot of data connections andconvoluted and complex code. This interface basically hides the specific data collectiondetails from the rest of the modules. Another very important feature of this module is thatit provides the flexibility to switch to another relational database management system likeOracle, MS Access etc, without modifying or altering other modules which implement theactual algorithms. This is because the other modules have to only deal with the databaseobject provided by the database module and not care about the interaction details.

The next level consists of the two algorithms and their respective data structures im-plemented in the proposed system. The two algorithms are presented by Kang [5] and Li[6] and consist of many changes that vary the complexity of the algorithms, ranking met-rics, threshold values and the order in which operations are performed. The third tier is theholistic approach which uses the algorithms mentioned in the above layer with changes andmodifications which improve the quality of matching. These changes and modificationswere described in the section above. All the three algorithms interact with the database tierto get data and metadata from the source and target databases. After the data collection pro-cess every algorithm now creates their required individual data structures. Some the datastructures created are ranking structures, binary schema trees, various kinds of dependencygraphs and schema graphs.

Page 23: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

13

Every approach generates a Matching structure that includes the matching from thesource schema to the target schema generated by the algorithm. The matching structurealso contains the score of the match so that the data analyst can evaluate the match. This isfinally supported by a graphical user interface so as to make the matching task as simplifiedas possible for the data analyst.

2.1.2 Design considerations in this project

As stated in the hypothesis, the aim of this project is to test the viability of all the threeapproaches against real world simulated datasets and analyze the results by varying factorslike field similarity and number of instances. We also need to analyze the dependency ofthese factors on the matching generating process for each algorithm. To achieve these re-sults the project adds a testing tier to the above architecture which helps in semi-automationof the tests performed on all the three approaches. This can be seen in the figure 2.2.

Figure 2.2: Design consideration in this project

The testing tier couples the database interface with the three matching techniques andinteracts with each tier during the testing process. The testing framework is customized insuch a way that the data analyst just needs to provide the source and target database namesfrom the SQL server and the correct matching as per the user. The framework automat-ically calculates the precision and recall for all the three approaches and puts them intocorresponding comma seprated value(CSV) files. Precision is the ratio of correct matchesto the number of generated matches. Recall is the ratio of the correct matches to the actualnumber of true matches.

Page 24: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

14

The framework at first gets the database objects for both source and target schema fromthe database interface. It then gives the objects to all the three approaches, instance, el-emental and holistic so as to get each corresponding matching instances. The matchinginstance contains the matching generated by each individual method. Once the matchinginstance is generated for a particular method, that instance is then sent to the module whichcalculates the precision and recall for each method. This is done by passing the matchinginstance to the precision and recall modules which checks the generated matching with theuser specified matching and calculates the precision and recall values for one case of fieldsimilarity and number of instances. Now the framework automatically increases the num-ber of instances by the value specified by the user and then runs the above process of recal-culating the matching instance, precision and recall for each approaches. This frameworksemi-automates the testing process by just varying one parameter which is the number ofinstances. Field similarity being the second parameter is modified manually in this project.

By adding an extra tier into the architecture not only automates the testing process butalso keeps it loosely coupled with other modules. It relieves the data analyst from theprocess of individually testing each approach and computing both precision and recall bymanually comparing the generated matching with the correct matching. This gives theanalyst the flexibility of comparing the evaluation metrics as they are stored in CSV filesfor each of the approaches.

2.2 Implementation

The work accomplished by Sinha et al.[11] implements the instance, elemental and holis-tic approach in a loosely coupled way so that a data analyst can run any algorithm at anytime. This project extends the implementation by introducing a new testing tier which helpsin better understanding and analyzing the behavior of the algorithms by varying decisiveparameters which effect the generated matching. The code for the testing interface is writ-ten in JAVA SE 6.0 and the source and target databases are stored in SQL server 2005. Theconnection between the database repository and the testing layer is done by the databaseinterface using JBC-ODBC connection.

The next sub-sections would walk-through one cycle of the test and demonstrate in de-tail how the testing framework interacts with the rest of the layers and generates a setof results for a particular field similarity and number of instances. As stated earlier theframework semi-automates the process, meaning in the next cycle it will increase the num-ber of instances for the same value of field similarity. The field similarity parameter is

Page 25: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

15

varied manually by the data analyst. We divide the testing framework into different mod-ules namely, Data Generator, True Matching data structure, Matching generator, EvaluationMetrics generator. All these modules are described in detail in the remainder of the chapter.

2.2.1 Data Generator

The first step in this cycle is to get information and metadata required to run the threealgorithms. As stated earlier this is done by the Database tier. The testing tier triggers thedatabase tier by providing it with names of source and target schemas and it in turn returnsto it the database objects for the same. It then uses these database objects to add instances tothe source and target data repositories. This task is accomplished by the following methods

1. insertintoDB1()

2. insertintoDB2()

These methods are called after every cycle and the source and target database are updatedwith the desired number of instances. In case of the experimental design used in thisproject we run the algorithms for a fixed field similarity say 40%, one when the number ofinstances is 10 and the other when the number of instances is 1000. We then change thefield similarity to 80% and repeat the same process.

2.2.2 Matching Generator

After generating the source and target database objects and loading the databases, thenext step is to calculate the matching structure for all the three algorithms. This process isdivided in to two steps

1. Pass the source and target database objects to the concerned algorithm tier.

2. Calculate the matching structure for the above algorithm.

In the matching generator module each algorithm has its own method which takes in thesource and target database objects to generate its matching structure. So say for example,the elemental based method takes in the database objects and calls in the elemental basedmatching tier which generates the matching structure. Now the element based methodreturns the matching structure to the evaluation metrics generator module. This processis repeated for the other two methods, which forms one testing cycle for one case of fieldsimilarity and number of instances.

Page 26: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

16

2.2.3 True Matching Vector

In the testing framework, the task of this data structure is to store the correct matchingbetween the source and target schemas as determined by the data analyst. The evaluationmetrics generator uses this data structure to determine the metrics, precision and recall bycomparing the generated matching with the true matching vector. Precision is the ratioof correct matches to the number of generated matches. Recall is the ratio of the correctmatches to the actual number of true matches.

2.2.4 Evaluation Metrics Generator

The task of this module is to generate precision and recall for the algorithm to which thematching corresponds. It takes in the matching instance of each of the three approaches andgenerates the evaluation metrics, Precision and recall. It uses the true matching vector to getthe information about the total number of matches in the two input schemas. Precision andrecall are measures for answer quality and are widely used in the text retrieval community[5]. Let a be the total number of matches produced by any of the three schema matchingalgorithms; b be the total number of true matches in the two input schemas as determinedby a data analyst; and c be the number of correct matches produced by the schema matchingalgorithm considered in the above two definitions. Precision and recall can be defined asfollows:

• Precision = c/a

• Recall = c/b

Precision and recall are the same if we are taking into consideration one-to-one mappingor onto-mapping [5] as the number of produced matches and true matches are always thesame due to cardinality constraints. In this project we have considered databases withone-to-many mappings to keep the precision and recall significantly different and also totest the performance of the algorithms when it comes to real world one-to-many mappingschemas. Lastly, we have two sections in this module; one calculates and returns the valueof precision when a matching structure is given as input and the other calculates and returnsthe value of precision. In this project, we have calculated values like precision and recallfor 48 different conditions by varying decisive parameters line field similarity and numberof instances. The testing design has been explained in detail in the next chapter.

Page 27: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

17

Chapter 3

Testing

3.1 Testing design and Methodology

The main hypothesis of this project was to analyze the quality of matching generatedby all the three implemented approaches in [11] by defining a testing framework and byvarying decisive and sensitive parameters. Testing this hypothesis meant that I had to takeinto consideration the decisive factors of all the three approaches. Moreover, matchingquality needed to be compared so as to generate conclusive results from them.

For analyzing the results generated from the above implementation, I used Minitab v16Statistical Software to generate and analyze the factorial design and also to construct fac-torial plots. The generated design is shown below in the figure 3.1

Figure 3.1: Testing design generated from Minitab

The diagram above displays one replication of results. For the above analysis the typeof design used was a General Full Factorial Design with number of factors equal to three.

Page 28: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

18

The three columns Method, Field Similarity and Instances represent the three factors. Eachfactor has its own respective layers and the types of these layers were converted to text aswe did not want the layers values to affect the output and our analyses. The method factorhas three levels namely; Elemental, Instance and Holistic. These represent the individualmethods which we have used in this project.

The next factor is field similarity. We used this factor for our analysis purpose as wewanted to test the elemental approach because it calculates the matching based on the syn-tax of column names. Field similarity has two layers, one is 40% field similarity and otheris 80% filed similarity. 40% field similarity signifies that the attribute names between thesource and target schema are 40% similar and 80% similarity means that they are 80%similar.

The third factor is number of instances. We used this factor for our analysis purpose aswe wanted to test the instance based approach because it calculates the matching based ondata and probability distribution of the databases. 10 and 1000 respectively signify thatthere are 10 and 1000 instances respectively in both source and target schemas. We haveincreased the number of instances as initial results of instance based matching shows thatas we increase the number of instances the matching quality improves. The holistic methoduses both instance and elemental based methods and makes improvements in both of themand hence depends on both field similarity and number of instances.

The two dataset columns represent the source and target schemas used in the currentexperiment set. And lastly, the precision and recall columns represent the precision andrecall generated using the current experimental settings. The diagram above represents oneout of the four iterations performed in this project.

As stated for this project we have used SQL Server 2005 as our data repository and thetesting scripts were written in JAVA SE 1.6.

Page 29: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

19

Chapter 4

Analysis

4.1 Analysis Methodology

The approach to testing, analyzing and verifying results within this project is focused onthe execution of testing scripts against a number of source and target database schemas thatrepresent a variety of scenarios. Testing has been segmented into two major sections, liningup with the requirements of the hypothesis: evaluating precision and recall. Precision andrecall calculated in the all the tests above have been measured versus the decisive factorsMethod, Field Similarity and number of instances. A factorial design with a general linearmodel and Design of Experiments(DOE) method were used for the factor comparisons.With these comparisons we were able to determine the significance and interactions of thefactors in the matching generating process. As stated in the implementation and testingchapters metrics gathering, reporting and analyzing tools have been setup to allow easyprocessing of results. The table 4.1 represents the factors, types, levels and values used ingenerating the results. The remainder of the chapter covers both the test results and theiranalysis.

Table 4.1: General Linear Model

Factor Type Levels ValuesMethod fixed 3 Elemental, Instance, HolisticFieldSim fixed 2 40, 80Instances fixed 2 10, 1000

4.2 Environment

The project implementation, testing and analysis have been performed on a Sony VAIOVGN-CS190, with Intel core 2 Duo processor and 3 GB of RAM. The software products

Page 30: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

20

required for this project are JAVA SE 1.6, SQL Server 2005 and Minitab 16 StatisticalSoftware.

4.3 Precision

4.3.1 Evaluating Precision versus Method, Field Similarity and In-stances

As stated above precision can be defined as the ratio of correct matches to the number ofgenerated matches. The figure 4.1 below shows the analysis of variance for the responseprecision with terms being method, field similarity, number of instances and all their pos-sible combinations. This can be shown in the source column below. It shows that we aretesting each individual term and their interactions with each other to determine their sig-nificance in the overall matching generation process. The interactions can be defined asfollows

Let the three factors be defined by the letters A, B and C. Model terms can be representedby the following general rules:

• A = main effect of factor A

• AB = two way interactions between factors A and B

• ABC = three way interactions between factors A, B and C

Figure 4.1: Analysis of Variance for Precision, using Adjusted Sum of Squares(SS) forTests

Page 31: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

21

The model terms in this project are method, field similarity and number of instances.They with their interactions are represented in the source column in figure 4.1. To de-termine which terms and their interactions are significant we define α (alpha) to be 0.05which signifies a confidence interval of 95%. We compare the value of alpha with the pvalue generated in the table and only those terms with p values less than are considered tobe significant. By doing this we come to a conclusion that only method and field similarityare significant when it come to calculating the value of precision and not the number of in-stances. The interactions also are not significant when it comes to analyzing the precision.

Thus going back to the statistical hypothesis we made for precision at the start, we cannow say that we reject the null hypothesis that there is no difference in the three methodsand each of them behaves differently under different conditions.

Page 32: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

22

4.3.2 Residual Plots for Precision

We generate a four-in-one residual plot for precision which consists of histogram, normalplot, residuals versus fit and residuals versus order plots. The plots are depicted in the figure4.2. When we see the normal probability plot we see that the dots lie more or less on the linewhich signifies that the data is normally distributed. Also the histogram plot is close to abell shaped structure which again shows that the data is normally distributed. The residualversus fits plot does not show any trumpet shape which signifies that the variances areequal and that there is no non-constant variance issue with precision. Finally the residualversus order does not show any kind of specific pattern which signifies that the data areindependent. All the three results, data normally distributed, variances are equal and dataare independent match with the assumptions of the DOE method used in the analysis.

Figure 4.2: Residual Plots for Precision

4.3.3 Main Effects Plot for Precision

Let us now analyze the main effects of precision on the three decisive factors Method,field similarity and number of instances. For this we choose the main effects factorial plotwith response variable being the overall average precision and decisive parameters beingthe factors included in the plots. The figure 4.3 shows the main effects plot for precision.

Page 33: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

23

In each of the graphs the x-axis represents the method, field similarity and number ofinstances respectively. On the other hand the y-axis represent the mean precision valuecalculated from the tests preformed in this project.

It shows us that when it comes to precision the elemental method is the best, followedby holistic and the instance does the worst. This result matches with the initial resultsperformed in Sinha et.al. [11]. The field similarity behaves as we have expected as wehave a good slope of the line from 40% to 80% which shows that it is significant in thematching generation process for the three methods. Thus again going back to the statisticalhypothesis we made for precision we can say that we reject the null hypothesis that thereis no difference in the results when we increase the field similarity from 40% to 80%.

On the other hand the slope of the line signifying the number of instances is small. Thissignifies that the number on instances doesnt seem to be all that important. This is becausewe have increased the instances from 10 to 1000, which makes the comparison windowsmall. If the comparison window is increased say from 10 to 10,000 we might see a goodslope and thus make the number of instances to be significant. Hence we can say that wefail to reject the null hypothesis for precision that there is no difference in the results whenwe increase the number of instances.

Figure 4.3: Main Effects Plot for Precision

Page 34: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

24

4.3.4 Interaction Plot for Precision

Lastly we analyze the interaction plots of precision on the three decisive factors method,field similarity and number of instances. For this we choose the interaction factorial plotwith response variable being the overall average precision and decisive parameters beingthe factors included in the plots. The figure 4.4 shows the interaction plot for precision.The y-axis here represents the mean precision value calculated from the tests performedin this project. We can see that the elemental and holistic method track very closely toeach other but not the instance. From the analysis of precision diagram we can see that theinteractions were not significant as their p values were not less than the value of α. Thisis proved from the figure 4.4 below as we cannot see any interaction between the decisiveparameters (method, field similarity and number of instances) as none of the lines crosseach other.

Figure 4.4: Interaction Plot for Precision

Page 35: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

25

4.4 Recall

4.4.1 Evaluating Recall versus method, field similarity and instances

Recall can be defined as the ratio of correct matches to the actual number of true matches.The figure 4.5 shows the analysis of variance for the response recall with terms beingmethod, field similarity, number of instances and all their possible combinations. This canbe shown by in the source column below. It shows that we are testing each individual termand their interactions with each other to determine their significance in the overall matchinggeneration process. The interactions can be defined as follows

Let the three factors be defined by the letters A, B and C. Model terms can be representedby the following general rules:

• A = main effect of factor A

• AB = two way interactions between factors A and B

• ABC = three way interactions between factors A, B and C

Figure 4.5: Analysis of Variance for Recall, using Adjusted Sum of Squares(SS) for Tests

The model terms in this project are method, field similarity and number of instances.They with their interactions are represented in the source column in figure 4.5. To de-termine which terms and their interactions are significant we define α (alpha) to be 0.05which signifies a confidence interval of 95%. We compare the value of alpha with the pvalue generated in the table and only those terms with p values less than are considered

Page 36: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

26

to be significant. By doing this we come to a conclusion that only method and field simi-larity are significant when it come to calculating the value of recall and not the number ofinstances. The interactions also are not significant when it comes to analyzing the recall.

Page 37: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

27

4.4.2 Residual Plots for Recall

We generate a four-in-one residual plot for recall which consists of histogram, normalplot, residuals versus fit and residuals versus order plots. The plots are depicted in thefigure 4.6. When we see the normal probability plot we see that the dots lie more or lesson the line which signifies that the data is normally distributed. Also the histogram plotis close to a bell shaped structure which again shows that the data is normally distributed.The residual versus the fits plots for recall shows a horizontal trumpet shape signifying thatthere is a non-constant variance issue with recall. This does not match with the assumptionsmade by the DOE method used in the analysis. To correct this I have re-run the analysisusing the natural log of recall and generated the plots again. The re-run analysis is definedin the next section. Finally the residual versus order does not show any kind of specificpattern which signifies that the data are independent. Only the two results, data normallydistributed and data are independent match with the assumptions of the DOE method usedin the analysis.

Figure 4.6: Residual Plots for Recall

Page 38: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

28

4.4.3 Rerunning the analysis for log (Recall)

When we generated the four-in-one residual plot for recall we found a trumpet shape inversus fit plot signifying a non-constant variance issue with recall. To fix this we take thenatural log of recall and re-run the analysis. The figure 4.7 shows the analysis of variancefor the response variable, log (recall). By analyzing the p-value of number of instanceswe can see that it becomes less than α, which means that number of instances becomessignificant when we run the analysis by taking log (recall) as our response variable.

Figure 4.7: Analysis of Variance for log(Recall), using Adjusted Sum of Squares(SS) forTests

Thus going back to the statistical hypothesis we made for log (recall) at the start, we cannow say that we reject the null hypothesis that there is no difference in the three methodsand each method has a statistically significant effect on recall.

The figure 4.8 shows the residual plots for the response variable log (recall) and we cansee that we get a better bell-shaped histogram than the one when we used the responsevariable as recall. Also the residual versus fit plot does not look like a trumpet shapesignifying a no non-constant variance issue with log (Recall).

Page 39: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

29

Figure 4.8: Residual Plots for log(Recall)

4.4.4 Main Effects Plot for log (Recall)

Let us now analyze the main effects of log (recall) on the three decisive factors method,field similarity and number of instances. For this we choose the main effects factorialplot with response variable being the overall average log (recall) and decisive parametersbeing the factors included in the plots. The figure 4.9 shows the main effects plot for log(recall). In each of the graphs the x-axis represents the method, field similarity and numberof instances respectively. On the other hand the y-axis represent the mean log(recall) valuecalculated from the tests preformed in this project.

It shows us that when it comes to log (recall) the holistic method does slightly betterthan elemental and the instance does the worst. This result matches with the initial resultsperformed in Sinha et.al. [11]. The field similarity behaves as we have expected as wehave a good slope of the line from 40% to 80% which shows that it is significant in thematching generation process for the three methods. Thus again going back to the statisticalhypothesis we made for log (recall) we can say that we reject the null hypothesis that thereis no difference in the results when we increase the field similarity from 40% to 80%.

On the other hand the slope of the line signifying the number of instances is small. Thissignifies that the number on instances doesnt seem to be all that important. This is becausewe have implemented a smaller comparison window for number of instances and increasing

Page 40: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

30

the comparison window improves the chances of having a good slope. Hence we can saythat we reject the null hypothesis for log (recall) that there is no difference in the resultswhen we increase the number of instances.

Figure 4.9: Main Effects Plot for log(Recall)

Page 41: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

31

4.4.5 Interaction Plot for log (Recall)

Lastly we analyze the interaction plots of log (recall) on the three decisive factors method,field similarity and number of instances. For this we choose the interaction factorial plotwith response variable being the overall average log (recall) and decisive parameters beingthe factors included in the plots. The figure 4.10 shows the interaction plot for log (recall).The y-axis here represents the mean log(recall) value calculated from the tests performedin this project. We can see that the elemental and holistic method track very closely to eachother but not the instance. From the analysis of log (recall) diagram we can see that theinteractions were not significant as their p values were not less than the value of α. This isproved from the figure 4.4 below as we cannot see any interaction between the decisive pa-rameters (method, field similarity and number of instances) as none of the lines cross eachother. The lines for holistic and elemental method cross signifying an interaction betweenmethod and number of instances, but the elemental method being independent of numberof instances the recall values will be constant. Thus we can ignore the crossing.

Figure 4.10: Interaction Plot for log(Recall)

Page 42: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

32

Chapter 5

Conclusions

The conclusions chapter usually includes the following sections.

5.1 Current Status

With schema matching being the focus of current data management research becauseof its utmost importance in dataware houses and data marts, it becomes essential to de-velop a generalized technique which helps in the matching process. Sinha et al.[11] im-plements an holistic approach which makes modifications in the instance and elementalapproaches so as to improve the quality of matching. From the experiments conducted in[11], each method generates a different matching of the source and target schemas. Henceeach method had to be tested and their behavior had to be analyzed using the evaluationmetrics.

To achieve the above task, this project proposed an additional testing layer which semi-automates the testing process and helps in generating results quickly and efficiently. Thisproject proves that each method is dependent on many matching factors like field similarityand number of instances and the quality of matching changes if we vary the matching fac-tors. My work also provides comparative analysis of all the three algorithms and analyzesthe results and which further helps support the experimental results made in [11].

5.2 Future Work

Although the work within this project develops a testing framework and shows that eachmethod depends on different parameters to generate the matching, there are several out-standing research avenues that can increase the scope of this project.

This project concentrated on developing and adding a testing framework and performedfour iterations of results. The task of the framework was to semi-automate the process and

Page 43: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

33

assist the data analyst to generate results easily and efficiently. This just gives us an insightto the initial results and proves that the three algorithms behave differently in different realworld scenarios. The algorithms can be rigorously tested against a large set of real worlddatasets thus increasing the number of iterations of the tests performed.

On the other hand the testing and analysis framework implemented in this project fo-cused on just two factors field similarity and number of instances. This limits the testingcomplexity as we are just testing against the edges of a cube with three major parameters,method, field similarity and number of instances. By increasing the parameters to test wecan increase the testing space and cover a larger section of real-world datasets. One suchparameter which can be introduced is the topology of the source and target schemas. Thisproject restricts itself on not considering the topology and assumes that the source andtarget schemas have the same topology. But testing the algorithms using the frameworkdefined in this project with different topology can widen the testing space of this project.

Lastly, based on the results we see that for our analysis of precision and recall the numberof instances do not seem to be significant in the matching generation process. This gives usa surprising result as the instance and holistic approach depend on the number of instancesfactor. In order to analyze this aspect more testing can be performed by increasing thewindow size of the number of instances, which might generate better results.

5.3 Lessons Learned

Testing and analysis performed in this project have demonstrated that each of the algo-rithms depend on several factors on generating the matching. We can see from the maineffects plot for both precision and recall that the holistic and the elemental approaches gohand in hand while generating matches and the instance does the worst. This is because wehave implemented a smaller comparison window for number of instances in this project.Increasing the window may produce better results and this leads to a possibility of newresearch.

This project has open doors to the thought of schema matching approaches in real-worldactive datasets. It was a very good experience exploring a relatively new territory. Al-though, sufficient research has been made on schema matching algorithms, not much hasbeen accomplished in testing them against real world datasets and analyzing the factors onwhich the final matching depends upon. The work done in [11] has been the base on whichI extend my project.

Page 44: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

34

This project taught me the importance of developing a framework which automates thetesting process and generates results with ease. The framework developed in this projecttests all the three approaches and semi-automates the process thus assists in the testingprocess for a data analyst.

On the other hand statistical analysis were made taking into consideration considerableamount of test results and factors on which the algorithms depend on. Also the resultsgenerated by this project matched with the results in [11]. The testing design and factorialplots from Minitab helped in analyzing and generating decisive results which opened upnew possibilities of research in the field of Schema matching.

Page 45: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

35

Bibliography

[1] Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, and Michael Flaster. Putting contextinto schema matching. In VLDB ’06: Proc. of 32nd Intl.Conf. on Very Large DataBases, pages 307–318, 2006.

[2] Christian Drumm, Matthias Schmitt, Hong-Hai Do, and Erhard Rahm. Quickmig:automatic schema matching for data migration projects. In CIKM ’07: Proc. of 16thACM Conf. on Conference on Information and Knowledge Management, pages 107–116, 2007.

[3] Bin He and Kevin Chen-Chuan Chang. A holistic paradigm for large scale schemamatching. SIGMOD Rec., 33(4):20–25, 2004.

[4] Bin He, Tao Tao, and Kevin Chen-Chuan Chang. Organizing structured web sourcesby query schemas: a clustering approach. In CIKM ’04: Proc. of 13th ACM Intl. Conf.on Information and Knowledge Management, pages 22–31, 2004.

[5] Jaewoo Kang and Jeffrey F. Naughton. On schema matching with opaque columnnames and data values. In SIGMOD ’03: Proc. of 2003 ACM SIGMOD Intl. Conf. onManagement of Data, pages 205–216, New York, 2003.

[6] Lingmei Li and Lan Yang. Automatic schema matching for data warehousing. InWCICA 2004, 5th World Congress on Intelligent Control and Automation, volume 5,June 2004.

[7] Wen-Syan Li and Chris Clifton. Semantic integration in heterogeneous databasesusing neural networks. In VLDB ’94: Proceedings of the 20th International Confer-ence on Very Large Data Bases, pages 1–12, San Francisco, CA, USA, 1994. MorganKaufmann Publishers Inc.

[8] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity flooding: Aversatile graph matching algorithm and its application to schema matching. In ICDE,pages 117–128, 2002.

Page 46: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

36

[9] Luigi Palopoli, Domenico Sacca, Giorgio Terracina, and Domenico Ursino. A unifiedgraph-based framework for deriving nominal interscheme properties, type conflictsand object cluster similarities. In COOPIS ’99: Proc. of 4th IECIS Intl. Conf. onCooperative Information Systems, Washington, DC, 1999.

[10] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schemamatching. VLDB Journal, 10(4), 2001.

[11] Priyanka Sinha, Rajendra K. Raj, and Carol J. Romanowski. A holistic approachto schema matching. In Data Mining and Data Engineering Symposium,CSIE 2009World Congress on Comp. Sci. and Info. Eng., Los Angeles, March 2009.

Page 47: A framework to test schema matching algorithmsbkd4833/Project.pdf · A framework to test schema matching algorithms by Bhavik Doshi A Project Report Submitted in Partial Fulfillment

37

Appendix A

Code Listing

A complete code listing is available on the attached CD.