BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY ADVISOR : KHONDKER

BIOMEDICAL DATA INTEGRATION BIOMEDICAL DATA INTEGRATION BASED ON BASED ON

METAQUERIER ARCHITECTUREMETAQUERIER ARCHITECTURE

GROUP MEMBERS-NAIEEM KHAN-EUSUF ABDULLAH MIM-M SAMIULLAH CHOWDHURY

ADVISOR : KHONDKER SHAJADUL HASANCO – ADVISOR : JAVED SIDDIQUE

DATA INTEGRATION DATA INTEGRATION

METAQUERIER METAQUERIER

ARCHITECTUREARCHITECTURE

BIOMEDICAL DATABIOMEDICAL DATA

Three basic parts of the project

DATA DATA

INTEGRATION INTEGRATION What does it mean?

Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data. This process emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories) .

Data integration appears with increasing frequency as the volume and the need to share existing data explodes.

Simple schematic for a data warehouse. The information from the source databases is extracted, transformed then loaded into the data warehouse

DATA DATA

INTEGRATION INTEGRATION

DATA DATA

INTEGRATION INTEGRATION Difficulties of Data Integration

Huge web database

Database content are now dynamic

Necessity of efficient data crawler

Accurate and perfect Query Interfaces

Time efficiency

Depth

Volume of data handling

ImportanceIntegration from web databases.

In order to get necessary information from different sources data integration is

very important.

In order to get a Large scale Integration

Efficient and accurate query answers.

Consider a user, who is moving to a new town. To start with, different queries

need different sources to answer: Where can she look for real estate

listings? (e.g., realtor.com.) Studying for a new car? (cars.com.)

Looking for a job? (monster.com.) Further, different sources support different

query capabilities: After source hunting, the user must then learn the

gruelling details of querying each source.

DATA DATA

INTEGRATION INTEGRATION


ARCHITECTURE ARCHITECTURE


ARCHITECTURE ARCHITECTURE There are different approaches and paradigms for data integration, some of

them are-

• Materialized: physical, integrated repository is created here.

• Data Warehouses: physical repositories of selected data extracted from a

collection of DBs and other information sources.

• Mediated: data stay at the sources, a virtual integration system is created.

• Federated and cooperative: DBMSs are coordinated to collaborate.

• Exchange: Data is exported from one system to another.

• Peer-to-Peer data exchange: Many peers exchange data without a central

control mechanism. Data is passed from peer to peer upon request,

as query answers.

Two basic concerns to use MetaQuerier

First, to make the deep Web systematically

accessible, it will help users find online

databases useful for their queries

To make the deep Web uniformly usable, it will

help users query online databases.




ARCHITECTURE ARCHITECTURE MetaQuerier Architecture has two basic stands

Dynamic DiscoveryAs Sources are Changing so they

must be dynamically discovered for integration. There are no preselected sources

On the Fly IntegrationAs queries are ad-hoc, so

MetaQuerier must mediated them on the fly for relevant sources. There is not pre configured sources




ARCHITECTURE ARCHITECTURE PARTS OF METAQUERIER SYSTEM

Front end

Back end

Deep web Repository

Results CompilationQuery TranslationSource Selection

Database Crawler, Source Clustering, Schema Matching, Interface Extraction


ARCHITECTURE ARCHITECTURE PROCESSES OF THE METAQUERIER SYSTEM ARCHITECTURE

DATA CRAWLER

INTERFACE EXTRACTION

SOURCE CLUSTERING

SCHEMA MATCHING

RESULT COMPILATION

QUERY TRANSLATION

SOURCE SELECTION

PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER

ARCHITECTURE ARCHITECTURE Data CrawlerData Crawler

A process to gather certain information from the web

database and other resources.

Similar to Web Crawling – A process used by search engines

to search on Internet as queried.

Data Crawler search data by filtering and categorizing to

make the system efficient.

It has two different segments –

Site Crawler

Shallow Crawler



Workflow:

oSite Crawler needs efficient query interface.

oIt takes user querable keywords from the interface and filters

the query.

oSite Crawler goes through the root page.

oIt identifies IP addresses.

oShallow Crawler follows those found IP addresses.



Advantages

Two major challenges can be accomplished through

data

crawling.

Dynamic Discovery

Deep web searching.

Dynamic Discovery is covered through Site Crawler.

Deep web searching is covered through Shallow

Crawler.


ARCHITECTURE ARCHITECTURE Interface ExtractionInterface Extraction

Interface Extraction extracts data the required data from the query

interfaces.

Query interface share similar query patterns but sometimes different.

Different query patterns arise due to hidden information or attributes.

These attributes are not visual on interface.

Workflow:

Data Crawler hands over a huge amount of unsorted and hidden data.

IE generates a query which extracts the found data.


ARCHITECTURE ARCHITECTURE Interface ExtractionInterface Extraction

Key Features

It takes query interfaces in HTML format.

Then it functions as a visual language parser.

Interface Extraction tokenizes the page, parses the tokens and

then merges potentially multiple trees.

Finally it generates the query capabilities.

The basic idea of interface extraction is to extract query

capabilities from query interfaces.


ARCHITECTURE ARCHITECTURE Source SelectionSource Selection

Defined a common mediated schema for all data sources, we

need to match and map the data sources according to mediated

schema.

The target user may understand the concepts in their own

domain but may not know what on other domains.

The solution is to set the sources to include in data integration

and what mediated schema to use.

All ontologies are stored in a common repository.

The system identifies which ontology will be used based on the

user submitted query.


ARCHITECTURE ARCHITECTURE Result CompilationResult Compilation

Last process of the data integration.

It aggregates query results to the user.

It compiles data results from different sources into coherent

pieces.

Will be used for extracting data from schema matching and

matching other attributes across different sources.

Source ClusteringSource Clustering

• Collaborates with source selection which works in the front-end.

• Clusters sources according to subject domain (e.g. edu, org etc).

• Sorts data as mediated process which provides data towards schema process.

• Main task is to construct a hierarchy of clusters with a given set of query capabilities.



Source Clustering (Cont.)Source Clustering (Cont.)CHARACTERISTICS OF DOMAIN ELEMENTS AND CONSTRAINT ELEMENTS:

• Textboxes cannot be used for constraint elements.

• Radio buttons or checkboxes or selection lists may appear as

constraint elements.

• An attribute consists of a single element cannot have constraint

elements.

• An attribute consisting of only radio buttons or checkboxes does

not have constraint elements.



Source Clustering (Cont.)Source Clustering (Cont.)

HOW TO DIFFERENTIATE BETWEEN DOMAIN & CONSTRAINT ELEMENTS: A simple two-step method can be used:

1. First, identify the attributes that contain only one element or whose elements are all radio buttons, or checkboxes or textboxes.

2. Second, an Element Classifier is needed to process other attributes that may contain both domain elements and constraint elements. Each element is represented as a feature of four: element name, element format type, element relative position in the element list, and element values.



Source Clustering (Cont.)Source Clustering (Cont.)DERIVING INFORMATION FROM ATTRIBUTES:

Four types of information for each attribute are defined (only for domain elements):

1. Domain type: Indicates how many distinct values can be used for an attribute for queries. There are four domain types are defined in our model:

range finite infinite Boolean

2. Value type: Each attribute on a search interface has its own semantic value type.

All input values are treated as text values



Source Clustering (Cont.)Source Clustering (Cont.)

3. Default Value: Indicate some semantics of the attributes. May occur in a selection list, a group of radio buttons and a group of checkboxes. Always marked as “checked” or “selected” .

4. Unit: Defines the meaning of an attribute value e.g., kilogram is a unit for weight.



Schema MatchingSchema Matching

• Schema defines the tables, the fields in each table, and the

relationships

between fields and tables.

• It is the graphical representation of a database structure.

• Schema matching is the process of identifying two objects whether

they are semantically related or not while mapping refers to the

transformations between the objects.

• In data integration, schema matching finds out the semantic domain

values among the attributes, which have been found through query

interfaces.




• Uses data from query capabilities and organize the data as per

requirement.

• It provides data to Source selection and Query Translation and

finally sends the data to users at the front-end.

• MetaQuerier redesigns the process in terms of complex matching

instead of one by one process.






Query TranslationQuery Translation

• A front end process.

•Translation is necessary to match and express query conditions in

terms of what an interface sends.

•It is critical to automatically interpret queries

Steps For complete query translation:

Step 1: extract constraint templates from a query interface.

Step 2: find matching templates from given source and target constraint templates



Query TranslationQuery Translation

Constraint mapping:

• The objective is to find the target constraint with the closest semantic meaning

to the source constraint.

Query mediation:

• Mediating queries across multiple sources.

• Abstract the problem as a pattern of answering query using views.

• Focus is to decompose a user query into sub-queries across multiple sources.

Schema mapping:

• Translates a set of data values from one source to another one, according to

given matching.

• Only concerns about the equality relation between different schemas.



BIO MEDICAL DATA - - BIO MEDICAL DATA - -

PROTEINPROTEINWHAT IS PROTEIN:

Any of a large group of nitrogenous organic compounds that are essential constituents of living cells; consist of polymers of amino acids; essential in the diet of animals for growth and for repair of tissues; can be obtained from meat and eggs and milk and legumes

TYPE OF

MACRO MOLECULE SUPER MOLUCULE

PART OF

AMINO ACID AMINO ALCANICACID PLOYPEPTIDE

BIO MEDICAL DATA - - PROTEINBIO MEDICAL DATA - - PROTEIN

SOME EXAMPLE OF PROTEIN INFORMATION

BIO MEDICAL DATA - - PROTEINBIO MEDICAL DATA - - PROTEIN

AVAILABLE WEB SERVICES ABOUT PROTEIN

Source Clustering Source Clustering (Example)(Example)

DERIVING INFORMATION FROM ATTRIBUTES:

1. Domain type: range, finite, infinite and Boolean

Here, two textboxes are used to represent a range for the attribute Production Year, thus the attribute should have range domain type.

Source Clustering Source Clustering (Example)(Example)2. Value type: Distinct Values

For example, the attribute Onlooker’s age or Reader age semantically has integer values, and Production date has date values

Source Clustering (Cont.)Source Clustering (Cont.)3. Default Value:

In the previous figure, the attribute Onlooker’s age has a default value

“all ages”

4. Unit:

one search interface may use “Milligrams/grams” as the unit of its

Concentration attribute, while another may use “Litres” for its Concentration

attribute.

Query Translation Query Translation (Example)(Example)

Two Bio-Medical Data query interfaces and their matching

• Name of Bio-Medical data – Proteins

• Constraint templates –

Look at the interfaces

T1: name T2: source T3: onlooker’s age T4: concentration

S1: name S2: category S3: concentration; [between; $low, $high] S4: onlooker’s age;[ in; {[18:65],…}]


S1: name S2: category S3: concentration; [between; $low, $high] S4: onlooker’s age;[ in; {[18:65],…}]

T1: name T2: source T3: onlooker’s age T4: concentration

Focus is to translate between “matching” constraint templates S2 in Q1

matches T2 in Q2.

We need to extract constraint templates (T1,…,T4) .

Given source and target constraint templates (Q1 and Q2 respectively), we need

to find matching templates.


Constraint mapping across Query Interfaces (Q1 and Q2)

• Constraint mapping is to instantiate T2 into t2 = [source; all words;

"Membrane Protein"]

The best translation of the source constraint s2, i.e., s2 t2


Translation rules T12 between Q1 and Q2

To translate queries we need the following mapping techniques:

r1 [category; contain; $s] emit: [source; all; $s]

r2 [name; contain; $t] emit: [name; contain; $t]

r3 [concentration range; between; $s, $t] $p = ChooseClosestNum($s), emit: [concentration; less than; $p]

r4 [onlooker’s age; between; $s] $r = ChooseClosestRange($s), emit: [age; between; $r]

Table: Translation Rules


Text type constraints operators: any, all, exact, start and string values, Numeric type constraints: equal, greater than, less than, between and numeric values.

The constraint mapping framework

Source constraint s and a target constraint template T

• Gives output to the closest target constraint topt, that T can generate

to s.

• The type recognizer identifies the type of the constraints, and then

dispatches them accordingly to the type handler.


• The type handler performs the search to find a good instantiation among possible ones described by T, which is then returned as the mapping.

•The type recognizer takes the source constraint s and target constraint

template T as input, and infers the data type by analyzing the constraints

syntactically.

• The type handler takes the constraints dispatched by the type recognizer

as input and performs search among possible instantiations of the target

constraint template for the best one.


Mapping the constraints between category in Q1 and source in Q2:

• Source constraint s = [category; contain; "Membrane Protein"] is instantiated from

template S = [category; contain; $val] by populating $val=" Membrane Protein"

• Target constraint template T = [source; $op; $val] accepts operators $op from

{"any words", "all words"} and value $val from any string.


t1[source; any; “Membrane Protein”]

t2[source; all; “Membrane Protein”]

t3[source; any; “Membrane”]

t4[source; any; “Protein”]

• Among the candidate target constraints t1, t2, . . . , from I(T), the constraint mapping thus searches for the element that is closest to the source.

EXAMPLE OF AN INTERFACEEXAMPLE OF AN INTERFACE

“Invention date” implies the Attribute is semantically a date data type.Two elements are used to specify a range query condition with

different roles in specifying the condition.Such semantic information is hidden from computers.Not defined on query interfaces.This HIDDEN information about each attribute needs to be revealed

and defined to enrich the schema matching.

EXAMPLE OF AN INTERFACEEXAMPLE OF AN INTERFACE

OVERVIEW OF THE SYSTEMOVERVIEW OF THE SYSTEM

SOURCES ARE NOT PREDEFINED AND PRE

CONFIGURED. SO NEED TO FIND SOURCES

DYNAMICALLY ACCORDING TO THE USER AD HOC

INFORMATION

AFTER DISCOVERY OF THE WEB DATABASES ITS IS

NEEDED TO EXTRACT THE QUERY CAPABILITIES AND

ITS IS ALSO AUTOMATIC AND ON THE FLY

THEN QUERYING THE SOURCES TRANSLATE THE

QUERY ON THE FLY SINCE SOURCE ARE UNSEEN

OVERVIEW OF THE SYSTEMOVERVIEW OF THE SYSTEM

WORK FLOW OF THE SYSTEM

BACK END SEMANTIC DISCOVERY- DATA CRAWLER

- automatically collect sources from the deep web- INTERFACE EXTRACTION

- Extract query capabilities from interface- SOURCE CLUSTERING

- Clustering interface into sub domain- SCHEMA MATCHING

- Discover semantic matchingFRONT END EXECUTION OF QUERY

- PROVIDE USER A DOMAIN CATEGORY- FOR EACH CATEGORY A UNIFIED INTERFACE IS GENERATED BY SM- SELECT APPROPRIATE SOURCES TO RUN QUERY (SS)- SELECTED SOURCES ARE TRANSLATED BY QUERY TRANSLATION- FINALLY AGGREGATE THE RESULT BY RESULT COMPILATION

CONCLUSIONCONCLUSION

Our target is to deploy MetaQuerier as an efficient data integration architecture.

The implementation can be done successfully based on Bio Medical Data

Inside the subsystem of MetaQuerier there are some conceptual changes can be done to improve the efficiency of handling huge unsorted data

THANK YOU THANK YOU

ANY QUESTIONANY QUESTION

??

Documents

BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY ADVISOR : KHONDKER