85
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE RECORD MATCHING OVER QUERY RESULTS i TABLE OF CONTENTS Page Table of Contents……………………………………………………….. i List of Tables…………………………………………………………… ii List of Figures………………………………………………………….. iii Acknowledgements……………………………………………….……. iv Synopsis……………………………………………………………….. v Chapters 1 INTRODUCTION…………………………………………….. 1 2 LITERATURE SURVEY…………………………………….. 3 3 SYSTEM ANALYSIS….…………………………………….. 13 4 TOOLS REQUIRED…………………………………………. 15 5 SYSTEM DESIGN ….………………………………………. 26 6 SYSTEM TESTING……………….…………………….…… 34 7 OBSERVATION AND ANALYSIS…….………………….. 39 CONCLUSION……………………………………………… 52 REFERENCES………………………………………………. 53

Record matching over multiple query result - Document

Embed Size (px)

DESCRIPTION

A detail project report on Record matching over multiple query result.

Citation preview

Page 1: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

TABLE OF CONTENTS

Page

Table of Contents……………………………………………………….. i

List of Tables…………………………………………………………… ii

List of Figures………………………………………………………….. iii

Acknowledgements……………………………………………….……. iv

Synopsis……………………………………………………………….. v

Chapters

1 INTRODUCTION…………………………………………….. 1

2 LITERATURE SURVEY…………………………………….. 3

3 SYSTEM ANALYSIS….…………………………………….. 13

4 TOOLS REQUIRED…………………………………………. 15

5 SYSTEM DESIGN ….………………………………………. 26

6 SYSTEM TESTING……………….…………………….…… 34

7 OBSERVATION AND ANALYSIS…….………………….. 39

CONCLUSION……………………………………………… 52

REFERENCES………………………………………………. 53

1. INTRODUCTION

Page 2: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Today, more and more databases that dynamically generate web pages in response to

user queries are available on the web. These web databases compose the deep or

hidden web, which is estimated to contain a much larger amount of high quality,

usually structured information and to have a faster growth rate than the static web.

Most web databases are only accessible via a query interface through which users

can submit queries. Once a query is received, the web server will retrieve the

corresponding results from the back-end database and return them to the user. To

build a system that helps users integrate and, more importantly, compare the query

results returned from multiple web databases, a crucial task is to match the different

sources records that refer to the same real world entity.

Benefits:

The project named record matching, identifies the records that represent the same

real-world entity, is an important step for data integration. To address the problem of

record matching in the Web database scenario, we present an unsupervised, online

record matching method, UDD, which, for a given query, can effectively identify

duplicates from the query result records of multiple Web databases. We use two

cooperating classifiers, a weighted component similarity summing classifier and an

SVM classifier, to iteratively identify duplicates in the query results from multiple

Web databases. Experimental results show that UDD works well for the Web

database scenario where existing supervised methods do not apply. This system was

designed in order to meet the disadvantages faced by the existing system. Existing

system does not provide record matching methods are supervised, which requires the

user to provide training data. These methods are not applicable for the Web database

scenario, where the records to match are query results dynamically generated on the

Page 3: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

fly. Such records are query-dependent and a relearned method using training

examples from previous query results may fail on the results of a new query

calculated in the existing system. Also, there is no efficient storage system for

storing all these details and it consumes more time. To address the problem of

record matching in the Web database scenario, we present an unsupervised, online

record matching method, UDD, which, for a given query, can effectively identify

duplicates from the query result records of multiple Web databases.

Record Matching:

Input design is the process of connecting the user-originated inputs into a computer

to used formats The goal of the input design is to make data entry Logical and free

from errors. Errors in the input database controlled by input design This application

is being developed in a user-friendly manner. The forms are being designed in such a

way that during the processing the cursor is placed in the position where the data

must be entered. An option of selecting an appropriate input from the values of

validation is made for each of the data entered. Concerning clients comfort the

project is designed with perfect validation on each field and to display error

messages with appropriate suggestions. Help managers are also provided whenever

user entry to a new field he/she can understand what is to be entered. Whenever user

enters an error data error manager displayed user can move to next field only after

entering a correct data. After removal of the same-source duplicates, the “presumed”

non-duplicate records from the same source can be used as training examples

alleviating the burden of users having to manually label training examples. Starting

from the non-duplicate set, we use two cooperating classifiers.

2. LITERATURE SURVEYGeneral

Page 4: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

According to Weifeng et al. (2010), record matching methods are supervised, which requires

the user to provide training data. Since query results are dynamically generated, record

matching methods are not applicable for the web database scenario. The record matching

works most closely related to unsupervised duplicate detection (UDD) is Christens

method. Using nearest based approach, Christen first performed a comparison step to

generate weighted vectors for each pair of records and select those weight vectors as

training examples.

UDD Algorithm

UDD algorithm is used for online duplicate detection. A linear kernel which is as fast

as kernel function is used in duplicate detection. Two classifiers are implemented to

avoid duplication problem. They are weighted component similarity summing

classifier (WCSS) and support vector machine classifier (SVM). In this algorithm,

WCSS plays an important role. It is used to identify some duplicate vectors when

there are no positive examples. After iteration begins, it is used again to cooperate

with SVM to identify new duplicate vectors. Since no duplicate vectors are available,

classifiers that need class information to train, such as decision tree cannot be used.

Two types of intuition in WCSS are duplicate intuition and non duplicate intuition.

In duplicate intuition the similarity between two records should be equal to one and

in nonduplicate intuition the similarity for two non duplicate records should be equal

to zero. Experimental results show that UDD works well for web database scenario

where existing supervised methods are not applicable. Two classifiers are

implemented to avoid duplication problem. They are weighted component similarity

summing classifier and support vector machine classifier.

Schema Matching

Page 5: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Bin and Kevin (2006), explains that, scheme matching is fundamental for supporting

query mediation across deep web sources. Schema matching provides co occurrence

in formation of attributes and these attributes are grouped by data complex matching

(DCM).The DCM framework consist of data processing, matching discovery and

matching construction. Before executing the DCM framework, query schemas in

web interfaces are not readily mixable in hyper text markup language (HTML)

format. To evaluate the performance of DCM framework, there are two experiments.

First is to isolate and evaluate effectiveness of DCM framework. Second is to

automatically extract the DCM framework.

Metaquerier

The goal of the metaquerier is to build a middleware system that help users to find

their query web sources. Metaquerier system has two critical parts which uses the

result of matching query interfaces. First part builds a unified interface for each

domain through which users can issue queries. Second part can be used for

improving the quality of source selection. In form assistant method, a form assistant

toolkit is developed instead of developing a complete metaquerier system. This

method helps users to translate queries from one interface to other relevant

interfaces. If a user fills the query form in one source then the form assistant can

suggest translated queries for another interested source. To enable such query

translation, the form assistant needs to find matching attributes between two

interfaces. The matching algorithm employs a domain thesaurus that specifies the

correspondences of attributes in the domain. In this scenario matching and translation

errors can be tolerated. The errors can be reduced by providing the best effort query

suggestion.

Bayesian Decision Models

Page 6: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Vassilos et al. (2002), suggested that, classification is one of the primary task of data

mining. The goal of classification is to correctly assign cases to one of a finite

number of classes. Bayesian decision theory is a fundamental statistical approach

to the problem of pattern classification. This approach is based on the decision

problem where the relevant probability values are known. According to this theory,

record matching or linking is the process of identifying records in a data store that refer to

the same real world entity. There are two types of record matching. The first type is called

exact or deterministic and it is primarily used when there are unique identifiers for each

record. The other type of record matching is called approximate. The two principle steps

in record matching process are searching and matching. In the searching step

potentially linkable pairs of records are searched. The matching step decides whether

a given pair is correctly matched or not. The theoretical decision rule is given by, If r

>upper, then designate pair as link. If lower ≤ r ≤ upper, then designate the pair as a

possible link and hold for clerical review. If r <lower, then designate the pair as non-

link. The upper and lower cutoff thresholds are determined by an error bounds on

false matches and false nonmatches.

Similarity Functions

Mikhail (2003), proposed two string similarity measures. They are learnable edit

distance with affine gaps and learnable vector space similarity based on pair wise

classification. These similarity functions can be trained using a corpus of labeled

pairs of equivalent and nonequivalent strings. Record linkage is the method for

combining two similarity functions. Record linkage algorithms fundamentally

depend on string similarity functions for discriminating between equivalent and non

equivalent record fields. Advanced record linkage system is adaptive record linkage

Page 7: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

using induction (MARLIN). This system learns to combine trainable string matrices

in a two layer framework for identifying duplicate database records.

Positive Example Based Learning

Hwanjo et al. (2004), documented that, positive example based learning (PEBL) for

web page classification eliminates the need for manually collecting negative training

examples in preprocessing. The goal is to achieve classification accuracy from

positive and unlabeled data. PEBL achieves high accuracy without loss of efficiency

in testing. There are two challenges in this approach. The first challenge is to collect

unbiased unlabeled data from a universal set. The second one is to achieve

classification accuracy from positive and unlabeled data. The PEBL framework

applies an algorithm called mapping convergence (MC) which uses the SVM

techniques. The property of the SVM technique is to ensure classification accuracy

from positive and unlabeled data. PEBL runs MC algorithm in the training phase to

construct an accurate SVM from positive and unlabeled data. Once the SVM is

constructed, classification performance in the testing phase will be the same as that

of a typical SVM in terms of both accuracy and efficiency. One class support vector

machine (OSVM) distinguishes one class of data from the rest of feature space given

only positive dataset. OSVM draws the class boundary of the positive data set in the

feature space.

Mapping Convergence Algorithm

The main goal of MC is to achieve high classification accuracy. MC runs in two

stages such as the mapping stage and the convergence stage. In the mapping stage,

the algorithm uses a weak classifier that draws an initial approximation of strong

negative data. The convergence stage uses a second base classifier to make

Page 8: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

progressively better approximation of negative data. MC identifies strong positive

features from positive and unlabeled data. This identification is done by checking the

frequency of the features within positive and unlabeled training data. A feature that

occurs in ninety percent of positive data but only in ten percent of unlabeled data is

said to be a strong positive feature. Consider a list of positive feature that occurs in

the positive training data more often than in the unlabeled data. By using this list of

the positive features, any positive data point can be filtered. The unlabeled data set

that leaves only strong negative data is called as strong negatives. OSVM draws its

boundary around positive data set in the feature space. Its boundary cannot be as

accurate as that of MC.

Fast Duplication Detection

Cyju and Sundar (2011), gave an idea about duplication detection that identifies the

records which represent the same real world entity. To address the problem of record

matching in the web database scenario, a fast duplication detection (FDD) is

introduced.. FDD uses clustering in order to reduce record comparison. For a given

query, FDD can effectively identify duplicates from the query result records of

multiple web databases. FDD is an efficient approach to solve duplication detection

problem in web database scenario where the records to match query are dependant

and can be changed dynamically. The various steps in duplicate detection includes

similarity calculation, k means clustering, vector generation, dynamic weight

allocation, support vector machine and actual duplicate identification. In similarity

calculation, a threshold value is set to identify the duplicate vector and non duplicate

vector. K means clustering method incorporates clustering before obtaining initial

potential duplicate and non-duplicate vector. Clustering is the unsupervised

classification of patterns or data items into groups or clusters. Vector generation

Page 9: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

method generates potential duplicate vector and non duplicate vector which is

clustered more proficiently. The dynamic allocation of weights to different fields in

each record is performed by the dynamic allocation algorithm. This step includes two

types of intuitions such as duplicate intuition and non duplicate intuition. SVM

classifier is a useful technique for data classification.

Duplicate Elimination

According to Vijayaraja et al. (2011), there are two system methodologies. They are

unsupervised duplicate elimination (UDE) and fuzzy ontological document

clustering (FODC). UDE is based on adjusting the weight set for record fields. UDE

employs a similarity function to find field similarity. It uses a similarity vector to

represent a pair of records. There are two classifiers in UDE. They are weight

component similarity summing classifier (WCSS) and support vector machine

classifier (SVM). WCSS is used to identify duplicate vectors. SVM classifier should

be insensitive to the relative size of positive and negative examples. Database

schema matching is very essential step in data integration. Data integration is the

process of finding mappings between attributes of two schemas that semantically

correspond to each other. In UDE, a global schema for specific type of records is

predefined and each query result has been matched to the global schema.

Fuzzy Ontological Document Clustering

The methodology for fuzzy ontological document clustering (FODC) includes the

following steps. The first step in building a patent ontology of the FODC method

requires the use of a knowledge based editing tool called protege. The tool assists the

domain experts in defining an ontology schema using a graphical interface. Natural

language processing and terminology training is the second step used in FODC. It is

Page 10: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

trained using set of patent documents. After natural language processing and

terminology training, all of the sentence concepts are inferred in the terminology

analyzer step. After analyzing the terminology, in the knowledge extraction it

computes the concept probabilities for each chunk. The chunks implying concepts as

predicates are the first to enter into the ontology. Patent similarity match method is

the final step in FODC.

Record Linkage

Peter (2004), explained about record or data linkage which is an important enabling

technology in the health sector. Linked data is a cost effective resource that can help

to improve research into health policies and uncover fraud within the health system.

Significant advances originating from data mining and machine learning have been

made in recent years in various areas of record linkage techniques. Most of these new

methods are not yet implemented in current record linkage systems, or are hidden

within black box commercial software. This makes it difficult for users to learn about

new record linkage techniques, as well as to compare existing linkage techniques

with new ones. As most real-world data collections contain noisy, incomplete and

incorrectly formatted information, data cleaning and standardization are important

pre-processing steps for successful record linkage, and also data can be loaded into

data warehouses or can be used for further analysis or data mining.

Febrl GUI Structure and Functionality

Freely available record linkage system (FEBRL) is implemented in Python, a free

object oriented programming language that is available on all major computing

platforms and operating systems. Originally developed as scripting language, Python

is now used in a large number of applications, ranging from Internet search engines

Page 11: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

and web applications to steering of computer graphics for Hollywood movies and

large scientific simulation codes. Many organizations use Python, including Google

and NASA, and due to its clear structure and syntax it is also used by various

universities for undergraduate teaching in introductory programming courses. Python

is an ideal platform for rapid prototype development as it provides data structures

such as sets, lists and dictionaries (associative arrays) that allow efficient handling of

very large data sets, and includes many modules offering a large variety of

functionalities. For example, it has excellent built-in string handling capabilities, and

the large number of extension modules facilitate, for example, database access and

graphical user interface (GUI) development. A general schematic outline of the

record linkage process is given in Figure.1.

Figure.1. General Record Linkage Process.

Page 12: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Entity Resolution

Omar et al. (2009), observed that, entity resolution (ER) is the process of identifying

and merging records to represent same real world entity. Record matching is

computationally expensive and application specific. For example, customer

information management solutions from a company have been interacting with users.

Combination of nickname algorithms, edit distance algorithms, fuzzy logic

algorithms, and trainable engines are used to match customer records. The

assumptions used in entity resolution are pair wise decisions, no confidences, no

relationships and consistent labels. Pair wise decisions match and merge records

operating between two records at a time. Their operation depends on the data in these

records, and not on the evidence in other records. No confidence functions may

compute numeric similarities, but in the end they make yes or no decisions to check

whether a record matches or not. No relationship records contain all the information

that pertains to each entity. In consistent labels, input data will go through a schema-

level integration phase, where incoming data is mapped to a common set of well-

defined labels.

Input Data Initialization

In the first step, a user has to select if she or he wishes to conduct a project for

cleaning and standardization of a data set reduplication of a data set, or linkage of

two data sets. The data page of the febrl graphical user interface (GUI) will change

accordingly and either show one or two data set selection areas. Several text based

data set types are currently supported, including the most commonly used comma

separated values (CSV) file format. This makes it difficult for users to learn about

new record linkage techniques, as well as to compare existing linkage techniques

with new ones. SQL database access will be added in the near future.

Page 13: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Information Integration

Sharma (2009), suggested about information integration, query by keywords and

ranking results in the context of web queries. Although internet search itself has been

around for a while and is used by general populace as well as by technical users,

querying the web is still in its infancy and is limited to specific domains and

applications. In contrast, querying a structured database has been around for several

decades and query answering as well query optimization has advanced to a

significant stage. The challenge now is whether the work on querying a structured

database can be redirected meaningfully towards querying the web. Unlike search,

querying the internet requires intelligent integration of information from multiple

web sources to construct meaningful answers. A number of new issues such as

how to pose a query, how to determine sources for answering a query, how to deal

with lack of schema, data extraction from web sources, web query optimization,

integration or combining data from multiple sources, and ranking of results, need to

be addressed in order to solve the problem of information integration.

Query Specification

The difference between search, metasearch, and information integration are

described and also introduces the differences between query processing over a

database and query processing over the internet. A general framework and an

architecture to highlight the sub problems involved in processing an arbitrary query

over the internet is also introduced. This will bring out a number of new problems,

their complexity and where they stand currently in terms of solutions, integration or

combining data from multiple sources present the general purpose problem along

with details of sub-problems that need to be solved in order to accomplish true

information integration.

Page 14: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

3. SYSTEM ANALYSIS

Analysis involves a detailed study of the current system, leading to specifications

of a new system. Analysis is a detailed study of various operations performed by

a system and their relationships within and outside the system. During analysis,

data are collected on the available files, decision points and transactions

questionnaire are the tools used for system analysis. The main points to be

discussed in system analysis are: Specifications of what the new system is to

accomplish based on the user requirements. Functional hierarchy shows the

functions to be performed by the new system and their relationship with each

other. The techniques of software engineering principles-system study and

analysis, system requirement specifications, system design, system coding,

system testing and implementation were obtained from the book, Fundamentals of

Software Engineering.

Project Features:

The featured project section gives the overview of the various tasks that are there

in the project along with their interpretations in the phases in the project. In the

system study and analysis phase the existing system was compared with the

proposed system by means of the analysis done in the course of the project. The

feasibility study was also done in this phase. All the requirements that are

needed in the project, both software and hardware requirements are specified

in the system requirement specification phase of the project. The system

design phase in the software development life cycle is an inevitable part in

the development of the project. The entire model for frame work of any

software development life cycle lies. The system design phase is an

inevitable part in the development of the project.

Page 15: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Requirement Analysis:

Systems analysis is the study of systems sets of interacting entities, including

computer systems. This field is closely related to operations research. It is also an

explicit formal inquiry carried out to help someone, referred to as the decision maker,

identify a better course of action and make a better decision than he might have

otherwise made. Employment utilizing systems analysis includes systems analyst,

business analyst, manufacturing engineers, enterprise and architect. Systems analysis

is the process of examining a business situation for the purpose of developing a

system solution to a problem or devising improvements to such a situation. Before

the development of any system can begin, a project proposal is prepared by the users

of the potential system and/or by system analysts and submitted to an appropriate

managerial structure within the organization. So the objective of the system analysis

phase is the establishment of the requirements for the system to be acquired,

developed and installed.

Relevant Analytics:

Relevant analytics capabilities are often interwoven into applications for sales,

marketing, and customer service. Sales analytics let companies monitor and

understand customer actions and preferences, through sales forecasting, data quality

management. Applications generally come with predictive analytics to improve

customer segmentation and targeting, and features for measuring the effectiveness of

online, offline, and search marketing campaign. Web analytics have evolved

significantly from their starting point of merely tracking mouse clicks on websites.

By evaluating customer buy signals marketers can see which prospects are most

likely to transact and also identify those who are bogged down in a sales process and

need assistance.

Page 16: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

4. TOOLS REQUIRED

Dot NET Framework an Overview:

The language is intended for use in developing software components suitable for

deployment in distributed environments. Source code portability is very important

especially for those programmers who are already familiar with C and C++. Support

for internationalization is very important. C# is intended to be suitable for writing

applications for both hosted and embedded systems, ranging from the very large that

use sophisticated operating systems, down to the very small having dedicated

functions. The Microsoft .NET Framework is a software framework that can be

installed on computers running Microsoft Windows operating systems. It includes a

large library of coded solutions to common programming problems and a virtual

machine that manages the execution of programs written specifically for the

framework. Although C# applications are intended to be economical with regard to

memory and processing power requirements, the language was not intended to

compete directly on performance and size with C or assembly language. The .NET

Framework is a key Microsoft offering and is intended to be used by most new

applications created for the Windows platform. The framework's Base Class Library

provides a large range of features including user interface, data and data access,

database connectivity, cryptography, web application development, numeric

algorithms, and network communications. The class library is used by programmers,

who combine it with their own code to produce applications. Programs written for

the .NET Framework execute in a software environment that manages the program's

runtime requirements. Also part of the .NET Framework, this runtime environment is

known as the Common Language Runtime. The CLR provides the appearance of an

application virtual machine so that programmers need not consider the capabilities of

Page 17: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

the specific CPU that will execute the program. The CLR also provides other

important services such as security, memory management, and exception handling.

The entire model for frame work of any software development life cycle is

intended to be suitable for writing applications for both hosted and embedded

systems, ranging from the very large that use sophisticated operating systems, down

to the very small having dedicated functions.

ASP.NET:

To create dynamic web pages by using server side scripts, Microsoft has

introduced ASP. The .NET version ASP is ASP.NET. It is a standard HTML file

that contains embedded server side script .ASP.NET provides various advantages

of server side scripting. ASP.NET enables to access information from data sources

such as backend databases and text files that are stored on a Web server or a

computer that is accessible to a Web server.ASP.NET enables to use a set of

programming code called templates to create HTML documents. The advantage of

using templates is that we can dynamically insert the content retrieved from data

sources, such as backend databases and text files into an HTML documents before

the HTML document is displayed to users. For this reason, the information need not

be changed manually as and when the contents retrieved from the data source

change.

ASP.NET in .NET Framework:

ASP.NET, which is the .NET version of ASP, is inbuilt on the Microsoft .NET

framework. Microsoft introduced the .NET framework to help developers create

globally distributed software with internet functionality and interoperability. As

Page 18: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

displayed in the preceding figure, the elements of an ASP.NET application includes

the web forms, the configuration files, and XML web service files. Microsoft

introduced the .NET framework to help developers create globally distributed

software with internet. The elements of an ASP.NET application services to

provide a mechanism for programs to communicate on the internet. Web forms and

state management features to ASP.NET runtime services include session

application state management, web security, and calling mechanism of ASP.NET

applications. The ASP.NET architecture is shown in the Figure.2.

ASP.NET Application Elements

ASP.NET Page Framework

XML

.NET Framework Base classes

Common Language Runtime

Figure.2.ASP.NET Framework

Web forms Pages with .aspx extension and corresponding class files

Configuration files with .config extension

XML Web services files with .aspx extension

ASP.NET Run-time Services State Management

View state

Session state

Application state

Web Security

Caching

Page 19: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Hardware Specifications:

The hardware specifications vary from time to time. For heavier applications the

hardware specifications will be of higher demand. A hard disk drive is a non-volatile

storage device which stores digitally encoded data on rapidly rotating platters with

magnetic surfaces. . Some of the main advantages of using D2 shape distributions are

that its concise to store, quick to compute, invariant to transforms, efficient to match,

insensitive to noise, insensitive to topology, robust to degeneracies, invariant to

deformations and discriminating. Strictly speaking, "drive" refers to a device distinct

from its medium, such as a tape drive and its tape, or a floppy disk drive and its

floppy disk Higher the hardware capability, higher will be the convenience for the

developer. The main memory or RAM is the working memory. It decides the speed

of the system. Hard disk is used for storing data. Processing speed determines the

speed of execution of the program. HDDs record data by magnetizing ferromagnetic

material directionally, to represent either a 0 or a 1 binary digit. They read the data

back by detecting the magnetization of the material. A typical HDD design consists

of a spindle that holds one or more flat circular disks called platters, onto which the

data are recorded. The platters are spun at very high speeds. Hardware specification

is shown in Table.1

Table.1. Hardware Specifications

Main Memory 512MB or Above

Hard Disk Min 10 GB Free

Processor Pentium 4

Processor Speed 1.5 GHZ

Page 20: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Software Specifications:

Operating system preferred is windows XP. Microsoft Visual Studio 2005 is the

development framework for the application which is also called the Integrated

Development Environment. Windows XP is the most preferable platform rather than

windows vista because xp is the most widely used operating system in the world. The

hardware specifications vary from time to time. For heavier applications the

hardware specifications will be of higher demand. It is the working turf for the

language that we choose.Sql server 2005 is the database, in which all the data related

to this project will be stored. Windows XP is the most preferable platform rather than

other versions because xp is the most widely used operating system in the world.

Microsoft visual studio 2005 is the development framework and The Microsoft .NET

Framework is a software framework that can be installed on computers running

Microsoft Windows operating systems. It includes a large library of coded solutions

to common programming problems and a virtual machine that manages the execution

of programs written specifically for the framework. The .NET Framework is a key

Microsoft offering and is intended to be used by most new applications created for

the Windows platform. The framework's Base Class Library provides a large range

of features. The software specification is shown in Table.2.

Table.2. Software Specifications

Operating System All versions of Windows

Required .net framework 2.0 Framework

Front end Microsoft Visual Studio 2005, VC#

Back end SQL Server 2005

Page 21: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Services of ASP.NET:

The run time services of ASP.NET interact with .NET Framework base classes,

which in turn, interact with the common language runtime it provide a robust web

based document environment. Application includes web forms, configuration

textbox, list box controls, and the application logic of web applications.

Configuration files enable to store the configuration settings of as ASP.NET

application. The elements of an ASP.NET application also include web services to

provide a mechanism for programs to communicate on the internet. Web forms and

state management features to ASP.NET runtime services include session

application state management, web security, and calling mechanism of ASP.NET

applications. The runtime services of ASP.NET interact with .NET framework base

classes, which in turn, interact with common language runtime to provide a robust

web based development environment.

Execution of an ASP.NET File:

To execute ASP.NET file, the following steps are taken: A web browser sends a

request for an ASP.NET file to a web server by using a uniform resource Locater.

The web server receives the request and retrieves the appropriate ASP.NET file

from the disk or memory. The web server forwards the ASP.NET file to the

ASP.NET script engine for processing. The ASP.NET script engine reads the file

from top to bottom and executes any server side script it encounters. The processed

ASP.NET file is generated as HTML document and the ASP.NET script engine

sends the HTML page to the web server. To execute the file web browser send a

request for an ASP. The web server then sends the HTML page to the client and the

browser interprets the output and displays that can be referred during future work.

Page 22: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Features of ASP.NET:

In addition to hiding script commands, ASP.NET has some advanced features that

help develop robust web applications. It includes the compiled code written in

ASP.NET that is only compiled and not interpreted. This makes ASP.NET

applications faster to execute than other server side scripts that are interpreted, such

as the scripts written in a previous version of ASP. The ASP.NET Framework is

provided with a rich toolbox and designer in the Visual Studio.NET IDE. Some of

the features of the powerful tool are, what u see is what you get edited. We can also

drag and drop server controls and can perform automatic deployment. ASP.NET

applications are based on common language runtime. As a result, the power and

flexibility of the NET platform is available to ASP.NET application developers.

Security feature of ASP.NET provide a number of options for implementing

security and restricting user access. Scalability feature of ASP.NET has been

designed that help to improve performance in a multiprocessor environment.

Applications:

ASP.NET applications enable to ensure that the .NET Framework class library,

messaging, and the data access solutions are seamlessly accessible on the

Web.ASP.NET. These features are language independent. ASP.NET enables to

build user interfaces that separate application logic from presentation content. In

addition, common language runtime simplifies application development by

using managed code services, such as automatic references counting and

garbage collection. The code written in ASP.NET is compiled and not interpreted.

For this reason, ASP.NET makes it easy to perform common tasks ranging from

submission and client authentication to web site configuration.

Page 23: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Extensible Markup Language:

SQL server includes native support for managing data, in addition to relational

data. For this purpose, it defined a data type that could be used either as a data type

in database columns or as literals in queries. Here columns can be associated with

schemas and the data being stored is verified against the schema. It is converted to

an internal binary data type before being stored in the database. Specialized

indexing methods were made available for XML data. Data is queried using X

Query. Common language runtime integration is the main feature with this edition

where one could write SQL code as managed code. In addition, it also defines a

new extension that allows query based modifications to XML data. When the data

is accessed over web services, results are returned as XML.

Data Storage:

The main unit of data storage is a database, which is a collection of tables with

typed columns. SQL Server supports different data types, including primary types

such as integer, float, decimal, char including character strings, varchar variable

length character strings, binary for unstructured blobs of data, text for textual data

among others. The rounding of floats to integers uses either symmetric arithmetic

Rounding or symmetric round down fix depending on arguments. Then

Microsoft SQL Server also allows user defined composite types to be defined

and used. It also makes server statistics available as virtual tables and views called

dynamic management views. In addition to tables, a database can also contain other

object including views, stored procedures, indexes and constraints, along with a

transaction log. The data in the database are stored as primary data files with an

extension .mdf. The amount of memory available to SQL server decides how many

pages will be cached in memory.

Page 24: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Features of SQL:

Log files are identified with the .1df extension. Storage space allocated to a

database is divided into sequentially numbered pages, each 8 KB in size. A page is

the base unit of input-output for SQL Server operations. A page is marked with a

96byte header which stores metadata about the page including the page number,

page type, free space on the page and ID of the object that owns it. Page type

defines the data contained in the page data stored in the database index, allocation

map which holds information about how pages are allocated to table and indexes,

change map which holds information about the changes made to other pages since

last backup or logging, or contain large data types such as image or text. While

page is the basic unit of an operation, space is actually managed in terms of an

extent which consists of 8 pages. A database object can either span all 8 pages in an

extent uniform extent or share an extent with up to 7 more object mixed extent. A

row in a database table cannot span more than one page, so is limited to 8 KB in

size. Secondary data files identified with an .ndf extension are used to store

optional metadata. also makes server statistics available as virtual tables and views

called dynamic. SQL Server supports different data types, including primary types

such as integer, float, decimal, char including character strings, variable

length character strings, binary for unstructured blobs of data, text for textual data

among others.

Database Table:

A table is split into multiple partitions in order to spread a database over a cluster.

Rows in each partition are stored in either Tree or heap structure. If the table has an

associated index to allow fast retrieval of rows, the rows are stored in order

according to their index values, with a Tree providing the index. The data is in the

Page 25: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

leaf node of the leaves, and other nodes storing the index values for the leaf data

reachable from the respective nodes. If the index is no clustered, the rows are not

sorted according to the index keys. An indexed view has the same storage structure

as an index table. A table without an index is stored in an unordered heap structure.

Both heap and btree can span multiple allocation units of data storage.

Data Retrieval:

The main mode of retrieving data from an data base from is querying or it. The

query is expressed using a variant of SQL called TSQL. Both the old as well as the

new version of the row are stored and maintained, through the old version are

moved out of the database into a system database identified as tempdb. When a row

Is in the process of being updated, any other requests are not blocked unlike locking

but are executed on the older version of the row. If the other request is an update

statement it will result in two different versions of the rows both of them will be

stored by the database, identified by their respective transaction ids declaratively

specifies what is to be retrieved. It is processed by the query processor, which

figures out the sequence of steps that will be necessary retrieve the requested data.

The sequence of actions that is necessary to execute a query is called a query plan

that is to be followed.

Query Processing:

There might be multiple ways to process the same query. For example, for a query

that contains a join statement and a select statement, executing join on both the

tables and then executing select on the results would give the same result as

selecting from each table and then executing the join, but result in different

execution plans. In such case, database server chooses the plan that is supposed to

Page 26: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

yield the results in the shortest possible time. Given a query, the query optimizer

looks at the database the schema, the database statistics and the system load at that

time. It then decides which sequence to access the tables referred in the query,

which sequence to execute the operations and what access method to be used to

access the tables. If the table has an associated index, whether the index should be

used or not if the index is on a column which is not unique for most of the columns

low selectivity, it might not be worthwhile to use the index to access the data. The

queries are sent by the client as taken as input parameters and send back the results

as output parameters. They can call defined functions and other stored procedures,

including the same stored procedure up to asset number of times. The run time

services of ASP.NET interact with .NET Framework base classes, which in turn,

interact with the common language runtime that provide a robust web based

document environment. Application includes web forms, configuration textbox, list

box controls, and the application logic of web applications. The runtime services of

ASP.NET interact with .NET framework base classes, which in turn, interact with

common language runtime to provide a robust web based service. ASP.NET makes

it easy to perform submission and client authentication to web site configuration.

Page 27: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

5. SYSTEM DESIGN

Systems design is the process or art of defining the architecture, components,

modules, interfaces, and data for a system to satisfy specified requirements. One

could see it as the application of systems theory to product development. There is

some overlap with the disciplines of systems analysis, systems architecture and

systems engineering. The broader topic of product development blends the

perspective of marketing, design, and manufacturing into a single approach. Design

is the act of taking the marketing information and creating the design of the product

to be manufactured. Systems design is therefore the process of defining and

developing systems to satisfy specified requirements of the user. Until the 1990s

systems design had a crucial and respected role in the data processing industry. In the

1990s standardization of hardware and software resulted in the ability to build

modular systems. The increasing importance of software running on generic

platforms has enhanced the discipline of software engineering. Object-oriented

analysis and design methods are becoming the most widely used methods for

computer system design.

Unified Modeling Language:

The Unified Modeling Language(UML) has become the standard language used in

Object-oriented analysis and design. It is widely used for modeling software systems

and is increasingly used for high designing non-software systems and organizations.

Graphical system design is a modern approach to designing, prototyping, and

deploying embedded systems that combines open graphical programming with

COTS. Graphical system design simplifies development resulting in higher-quality

designs with a migration to custom design. Designing is basically a way for domain

Page 28: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

experts or non-embedded experts to access the embedded designs where they would

traditionally need to outsource an embedded design expert. This approach to

embedded system design is a super-set of Electronic System Level (ESL) design.

Graphical system design expands on the EDA-based ESL definition to include other

types of embedded system design including industrial machines and medical devices.

Many of these expanded applications can be defined as the long tail applications.

Graphical system design is a complementary but encompassing approach that

includes embedded and electronic system design, implementation, and verification

tools. ESL and graphical system design are really part of the same movement. Higher

abstraction and more design automation looking to solve the real engineering

challenges that designers are facing today. Addressing design flaws that are

introduced at the specification stage to ensure they are detected well before

validation for on time product delivery.

Integrated Development Environment:

Microsoft Visual Studio is an Integrated Development Environment from Microsoft.

It can be used to develop console and graphical user interface applications. Windows

forms applications, web sites, web applications, and web services in both native code

and managed code are supported by Microsoft Windows. Windows Mobile,

Windows CE, .NET Framework, .NET Compact Framework and Microsoft

Silverlight also supports Microsoft windows. Visual Studio includes a code editor

supporting Intelligence as well as code refactoring. The integrated debugger works

both as a source-level debugger and a machine-level debugger. Other built in tools

includes a form designer for building GUI applications, web designer, class designer,

and database schema designer. It also accepts plug ins that enhance the functionality

at almost every level including adding support for source control systems like

Page 29: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Subversions and Visual SourceSafe for some new tool sets like editors and visual

designers, Domain specific languages or toolsets for other aspects of the software

development lifecycle. Visual Studio supports languages by means of language

services, which allow the code editor and debugger to support nearly any

programming language, provided a language-specific service exists. It Supports

languages such as M, Python, and Ruby which are available through language

services installed separately. It also supports JavaScript and CSS. Language specific

versions of Visual Studio also exist which provides more limited language services

to the user. These individual packages are called Microsoft Visual Basic, Visual J#,

Visual C#, and Visual C++. Microsoft provides Express editions of its Visual Studio

2010 components Visual Basic, Visual C#, Visual C++, and Visual Web Developer

at no cost. Visual Studio 2010, 2008 and 2005 Professional Editions, along with

language-specific versions of Visual Studio 2005 are available for free to students as

downloads through Microsoft's Dream Spark program.

Code Editor:

Visual Studio, like any other IDE, includes a code editor that supports syntax

highlighting and code completion using IntelliSense. It is used not only for variables,

functions and methods but also for language constructs like loops and queries.

IntelliSense supports included languages such as XML, Cascading Style Sheets and

JavaScript for development of web sites and web applications. Auto complete

suggestions are popped up in a modeless list box, overlaid on top of the code editor.

In Visual Studio 2008 onwards, it can be made temporarily semi transparent to see

the code obstructed by it. The code editor is used for all supported languages. The

Visual Studio code editor also supports setting bookmarks in code for quick

navigation. Other navigational aids include collapsing code blocks and incremental

Page 30: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

search in addition to normal text and report search. The code editor also includes

amulti item clipboard and a task list. The code editor supports code snippets, which

are saved templates for repetitive code and can be inserted into code and customized

for the project being worked on. Visual Studio provides background compilation

which is called as incremental compilation. As code is being written, Visual Studio

compiles it in the background in order to provide feedback about syntax and

compilation errors, which are flagged with a red wavy underline. Warnings are

marked with a green underline. Background compilation does not generate

executable code, since it requires a different compiler than the one used to generate

executable code. Background compilation was initially introduced with Microsoft

Visual Basic but has now been expanded for all included languages.

Debugger:

Visual studio includes a debugger that works both as a source level debugger and

machine level debugger. It works with both managed code as well as native code and

can be used for debugging applications written in any language supported by visual

studio. In addition, it can also attach to running processes and monitor and debug

those processes. If source code for the running process is available, it displays the

code as it is being run. If source code is not available, it can show the disassembly.

The Visual Studio debugger can also create memory dumps and load them later for

debugging. Multi-threaded programs are also supported. The debugger can be

configured to be launched. When an application runs outside, debugger crashes and

the system reports an error in the internal system registry. The debugger allows

setting breakpoints (which allow execution to be stopped temporarily at a certain

position) and watches (which monitor the values of variables as the execution

progresses). Breakpoints can be conditional, meaning they get triggered when the

Page 31: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

condition is met. Code can be stepped over, i.e., run one line of source code at a time.

During coding, the Visual Studio debugger lets certain functions be invoked

manually from the immediate tool window. The parameters to the method are

supplied at the immediate window. The DFD is also known as bubble chart. It is

simple graphical formalism that can be used to represent a system in terms of data to

the system. Various processing are carried out on these data and the output data is

generated by the system.

Search Engine:

Most Web databases are only accessible via a query interface through which users

can submit queries. Once a query is received, the Web server will retrieve the

corresponding results from the back-end database and return them to the user. To

build a system that helps users to integrate and more importantly, compare the query

results returned from multiple Web databases, a crucial task is to match the different

sources records that refer to the same real-world entity as shown in Figure.3.

Figure.3.Search Engine

Database

Database

Database

Query Result set Searching Databases

Page 32: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Level 0:

The Authentication module is used for security purpose. Only the authenticate person

can make all the operation in this project. That authentication is given to

administrator only. Administrator is the only authorised user for this smart record

matching software. The User is only allowed for Searching as shown in Figure.4.

Figure.4.Level 0

Page 33: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Level 1:

The fields of several records are uploaded in which the search must proceed in order

to find the matching records. The Administrator is only allowed to upload the book

data’s. This module contains Book Shop Name, Author Name, ISBN Number, Book

title, Year and Specified Book for uploading as shown in Figure.5.

Figure.5.Level 1

Page 34: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Use Case Diagram:

So Many Book Shops are available in this world. Each Book Shop have separate

database for future references or any other references. The Book shop want to upload

the books in website definitely create the database in web for books. Before Upload

the book he/she have to fill the registration form in website. It contains the Book

Shop Name, Address of the book shop, Contact Number, Mail ID, User Name and

each book shop have separate password for later login. The users are allowed to

search for books as shown in Figure.6.

Figure.6.Use Case Diagram

User1

User2

User3 Etc.,

Search Engine Book Shop 1

Book Shop 2

Book Shop 3Etc.,

Books Uploading

Books Uploading

Databases

Page 35: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

6. SYSTEM TESTING

Software testing is an investigation conducted to provide stakeholders with

information about the quality of the product or service under test. Software testing

also provides an objective independent view of the software to allow the business to

appreciate and understand the risks at implementation of the software. Testing

techniques include the process of executing a program or application with the intent

of finding software bugs. Software testing can also be stated as the process of

validating and verifying that a software program meets the business and technical

requirements that guided its design and development works as expected and can be

implemented with the same characteristics.

Software Testing:

Software testing can be implemented at any time in the development process

depending on the testing method employed. However, most of the test effort occurs

after the requirements have been defined and the coding process has been completed.

As such, the methodology of the test is governed by the software development

methodology adopted. Different software development models will focus the test

effort at different points in the development process. Newer development models,

such as Agile, often employ test driven development and place an increased portion

of the testing in the hands of the developer, before it reaches a formal team of testers.

In a more traditional model, most of the test execution occurs after the requirements

have been defined and the coding process has been completed. Testing can never

completely identify all the defects within software. Instead, it furnishes a criticism or

comparison that compares the state and behavior of the product against oracles,

principles or mechanisms by which someone might recognize a problem. These

Page 36: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

oracles may include (but are not limited to) specifications, contracts, comparable

products, past versions of the same product, inferences about intended or expected

purpose, user or customer expectations, relevant standards, applicable laws, or other

criteria. Every software product has a target audience. For example, the audience for

video game software is completely different from banking software. Therefore, when

an organization develops or otherwise invests in a software product, it can assess

whether the software product will be acceptable to its end users, its target audience,

its purchasers, and other stakeholders. Software testing is the process of attempting

to make this assessment. A primary purpose for testing is to detect software failures

so that defects may be uncovered and corrected. This is a non-trivial pursuit. Testing

cannot establish that a product functions properly under all conditions but can only

establish that it does not function properly under specific conditions.

Scope:

The scope of software testing often includes examination of code as well as

execution of that code in various environments and conditions as well as examining

the aspects of code: does it do what it is supposed to do and do what it needs to do. In

the current culture of software development, a testing organization may be separate

from the development team. There are various roles for testing team members.

Information derived from software testing may be used to correct the process by

which software is developed. Functional testing refers to tests that verify a specific

action or function of the code. These are usually found in the code requirements

documentation, although some development methodologies work from use cases

or user stories. Functional tests tend to answer the question of can the user do this or

does this particular feature work. Non functional testing refers to aspects of the

software that may not be related to a specific function or user action, such as

Page 37: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

scalability or security. Non-functional testing tends to answer such questions as how

many people can log in at once, or how easy is it to hack this software. Not all

software defects are caused by coding errors. One common source of expensive

defects is caused by requirement gaps, unrecognized requirements that result in

errors of omission by the program designer. A common source of requirements gaps

is non-functional requirements such as testability, scalability, maintainability,

usability, performance, and security.

Tolerance:

Software faults occur through the following processes. A programmer makes an error

mistake, which results in a defect fault, bug in the software source code. If this defect

is executed, in certain situations the system will produce wrong results, causing a

failure. Not all defects will necessarily result in failures. For example, defects in dead

code will never result in failures. A defect can turn into a failure when the

environment is changed. Examples of these changes in environment include the

software being run on a new hardware platform, alterations in source data or

interacting with different software. A single defect may result in a wide range of

failure symptoms. A common cause of software failure real or perceived is a lack of

compatibility with other application software, operating systems or operating system

versions, old or new, or target environments that differ greatly from the original such

as a terminal or GUI application intended to be run on the desktop now being

required to become a web application, which must render in a web browser. For

example, in the case of a lack of backward compatibility, this can occur because the

programmers develop and test software only on the latest version of the target

environment, which not all users may be running. This result in the unintended

consequence that the latest work may not function on earlier versions of the target

Page 38: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

environment or on older hardware those earlier versions of the target environment

was capable of using. A very fundamental problem with software testing is that

testing under all combinations of inputs and preconditions initial state is not feasible,

even with a simple product. This means that the number of defects in a software

product can be very large and defects that occur infrequently are difficult to find in

testing. More significantly, non-functional dimensions of quality how it is supposed

to be versus what it is supposed to usability, scalability, performance, compatibility,

and reliability can be highly subjective; something that constitutes sufficient value to

one person may be intolerable to another.

Validation:

There are many approaches to software testing. Reviews, walkthroughs, or

inspections are considered as static testing, whereas actually executing programmed

code with a given set of test cases is referred to as dynamic testing. Static testing can

be and unfortunately in practice often is omitted. Dynamic testing takes place when

the program itself is used for the first time which is generally considered the

beginning of the testing stage. Dynamic testing may begin before the program is

100% complete in order to test particular sections of code modules or discrete

functions. Typical techniques for this are either using stubs/drivers or execution from

a debugger environment. For example, spreadsheet programs are, by their very

nature, tested to a large extent interactively on the fly, with results displayed

immediately after each calculation or text manipulation. Though controversial,

software testing may be viewed as an important part of the software quality

assurance (SQA) process. In SQA, software process specialists and auditors take a

broader view on software and its development. They examine and change the

Page 39: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

software engineering process itself to reduce the amount of faults that end up in the.

delivered software: the so-called defect rate.

Software Faults:

Software faults occur through the following processes. A programmer makes an error

mistake, which results in a defect fault, bug in the software source code. If this defect

is executed, in certain situations the system will produce wrong results, causing a

failure. Not all defects will necessarily result in failures. For example, defects in dead

code will never result in failures. A defect can turn into a failure when the

environment is changed. A very fundamental problem with software testing is that

testing under all combinations of inputs and preconditions initial state is not feasible,

even with a simple product. This means that the number of defects in a software

product can be very large and defects that occur infrequently are difficult to find in

testing. Test techniques include, but are not limited to, the process of executing a

program or application with the intent of finding software bugs. Software testing can

also be stated as the process of validating and verifying that a software program. To

make each query more accurate, accuracy settings were adjusted accordingly.

Though controversial, software testing may be viewed as an important part of the

software quality assurance (SQA) process. In SQA, software process specialists and

auditors take a broader view on software and its development. They examine and

change the software engineering process itself to reduce the amount of faults that end

up in the delivered software: the so-called defect rate. What constitutes an acceptable

defect rate depends on the nature of the software.

Page 40: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

7. OBSERVATION AND ANALYSIS

The record matching system is proposed to overcome the disadvantages of the

existing system. The existing system is based on predefined matching rules hand

coded by domain experts. The matching rules are learned offline by some learning

method from a set of training examples. Hand coding or offline learning approaches

are not appropriate because of two reasons. The first reason is that full data set is not

available beforehand and therefore good representative data for training are hard to

obtain. Second reason is that even if good representative data are found and labelled

for learning, the rules learned on the representatives of a full data set may not work

well on a partial and biased part of that data set.

Record Matching:

To overcome the disadvantages of the existing system, a new record matching

method is proposed which is called as unsupervised duplicate detection (UDD)

method. UDD is used for identifying duplicates among records in query results from

multiple Web databases. This method focus on techniques for adjusting the weights

of the record fields in calculating the similarity between two records. Two records

are considered as duplicates if they are similar enough on their fields. Due to the

absence of labeled training examples this method use a sample of universal data

consisting of record pairs from different data sources as an approximation for a

negative training set as well as the record pairs from the same data source. The

experimental results verify that this method is reasonable since the proportion of

duplicate records in the universal set is usually much smaller than the proportion of

Page 41: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

non duplicates. The two main advantages for this system are that it solves the online

duplicate detection problem and also it provides specific record matching.

Programming Environment:

Microsoft visual studio is an integrated development environment (IDE) from

microsoft. It can be used to develop console and graphical user interface applications

along with windows forms applications, web sites, web applications, and web

services in both native code together with managed code for all platforms supported

by microsoft windows, windows mobile, windows CE, .NET framework, .NET

compact framework and microsoft silver light which. The Integrated Developing

Environment is as shown in Figure.7.

Figure.7. Developing Environment

Form Screen Shots:

The screen shots provide an overview of the various forms which are required for the

implementation of the project. This system considers the registration of books and its

searching and uploading. For this project there are several forms and several tables in

Page 42: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

the database which includes user registration, user login, new book registration,

searching based on number and name, user registration report and book report.

Home Page:

Home page provides different options for registration, search engine and the contact

details. It also explains about record matching so that users can understand about

record matching and the function of this system. On clicking the registration button,

a registration page will be shown. Using this, user can register by giving the

company name or book name. Search engine button can be clicked in order to search

books based on book name, author name and number. The homepage is shown in

Figure.8.

Page 43: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.8.Home page

User Registration:

User registration page provides provisions for new users to register to the system.

Here, the user can register by giving the company name, address, contact number,

email-id, user name and password. The page below shows the registration of a

bookshop named winner. By clicking the submit button, user can confirm the

registration. After registration the user can login to the system using the user name

and password. The user registration page is show in the Figure.9.

Page 44: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.9.User Registration Page

User Login:

User login page provide the new and existing users to login to the system by giving

the user name and password. Only the registered user can login to the system. If the

user name and password entered is correct, the user can login successfully.

Otherwise login will be failed. Registered user can add new book and search the

required books whenever necessary. The user login page is shown in Figure.10.

Page 45: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.10.User Login Page

New Book Registration:

This page provides provisions for adding new books by giving the book shop name,

ISBN number, author name, book title, publication, URL (Uniform Resource

Locator), year of publication and cost. It also provides provisions for uploading

books. User can upload the books from the provided URL. The book registration

page is shown in Figure.11.

Page 46: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.11.Book Registration Page

ISBN Number Wise Searching:

The searching page is found by clicking the search engine button from the home

page. Searching can be done based on the ISBN number, author name, book name

and year. In ISBN number wise searching, searching is done on the basis of ISBN

number. The user can be able to search by giving the whole ISBN number or a part

of it. The searched result is obtained after avoiding duplication. Here, the user is

given the last three digits of the ISBN number as shown in Figure.12.

Page 47: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

.

Figure.12.ISBN Number Wise Searching

Book Name Wise Searching:

In this searching, searching is done on the basis of book names. The user can be able

to search by giving the whole book name or a part of it. Here the user is given a part

of the book name and he retrieved all the results that match the given query. The

searched result is obtained after avoiding duplication. From the retrieved results, user

can select the required book and can upload the book. Book name wise searching is

shown in the Figure.13.

Page 48: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.13.Book Name Wise Searching

Year Wise Searching:

In this searching, searching is done on the basis of year of publication. The user can

be able to search by giving the year of publication. Here the user is given the year as

2008 and he retrieved all the results that match the given query. The searched result

is obtained after avoiding duplication. From the retrieved results, user can select the

required book and can upload the book. User can upload the book published in the

year 2008. Year wise searching is shown in the Figure.14.

Page 49: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.14.Year Wise SearchingAuthor Wise Searching:

In this searching, searching is done on the basis of author’s name. The user can be

able to search by giving the name of the author as a whole or a part of it. Here the

user is given the part of the author name and he retrieved all the results that match

the given query. The searched result is obtained after avoiding duplication. From the

retrieved results, user can select the required book and can upload the book. Author

wise searching is shown in the Figure.15.

Page 50: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.15.Author Wise SearchingAdministrator Login:

Administrator can login to the system by giving the admin name and password. After

login, the administrator can be able to view the registration report, books report from

both the databases. The administrator can view the list of registered users by clicking

the registration report button. The administrator can view the list of books from both

the databases by clicking the book report buttons. Administrator login provision is

shown in the Figure.16.

Page 51: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.16.Administrator LoginDatabase Design:

Microsoft SQL Server is a relational model database server produced by Microsoft.

Its primary query languages are T-SQL and ANSI SQL. SQL Server allows multiple

clients to use the same database concurrently. As such, it needs to control concurrent

access to shared data, to ensure data integrity - when multiple clients update the same

data, or clients attempt to read data that is in the process of being changed by another

client. SQL Server provides two modes of concurrency control: pessimistic

Page 52: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

concurrency and optimistic concurrency. Protocol layer implements the external

interface to SQL Server. The Database design window is shown in Figure .17.

Figure.17. Database Design

User Registration Report:

User registration is done by all the users. The user registration report comprises of

the id, the name of the book shop, address of the book shops, contact number as well

the address. This helps the user to contact different book shops. The user can contact

the shops even through mail or using phone since both the details are provided in the

user registration report. The user registration report is shown in Figure.18.

Page 53: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

Figure.18. User Registration Report

CONCLUSION

Record matching, which identifies the records that represent the same real-world

entity, is an important step for data integration. Duplicate detection is an important

step in data integration and most methods are based on offline learning techniques,

which require training data. Record matching methods are supervised, which requires

the user to provide training data. Since query results are dynamically generated,

Page 54: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i

record matching methods are not applicable for the web database scenario . In the

Web database scenario, where records to match are greatly query-dependent, a pre

trained approach is not applicable as the set of records in each query’s results is a

biased subset of the full data set. To overcome this problem an unsupervised, online

approach, unsupervised duplicate detection (UDD) is introduced.UDD is used for

detecting duplicates over the query results of multiple Web databases and also used

for online duplicate detection. A linear kernel which is as fast as kernel function is

used in duplicate detection Two classifiers are implemented to avoid duplication

problem. They are weighted component similarity summing classifier (WCSS) and

support vector machine classifier (SVM). In this algorithm, WCSS plays an

important role. It is used to identify some duplicate vectors when there are no

positive examples. After iteration begins, it is used again to cooperate with SVM to

identify new duplicate vectors. Since no duplicate vectors are available, classifiers

that need class information to train, such as decision tree cannot be used. Two types

of intuition in WCSS are duplicate intuition and non duplicate intuition. In duplicate

intuition the similarity between two records should be equal to one and in no

duplicate intuition the similarity for two non duplicate records should be equal to

zero.

REFERENCES 1. Yang C. H., V. Sotiris., E. Rhemand and N. Vichare, (2010), “Automation of

Data Mining for Telemetry Database of Computer Systems”., An International Symposium on Computer, Control and Automation., Tainan, Taiwan.

2. Kumar S. and M. Pecht, (2008), “Baseline Performance of Notebook Computer under Various Environmental and Usage Conditions for Prognostics”., International Journal of Computer, Information, Systems Science and Engineering., 5, 9, 234.

Page 55: Record matching over multiple query result - Document

NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE

RECORD MATCHING OVER QUERY RESULTS i