Upload
grace-bloxham
View
216
Download
2
Embed Size (px)
Citation preview
BIOMEDICAL DATA INTEGRATION BIOMEDICAL DATA INTEGRATION BASED ON BASED ON
METAQUERIER ARCHITECTUREMETAQUERIER ARCHITECTURE
GROUP MEMBERS-NAIEEM KHAN-EUSUF ABDULLAH MIM-M SAMIULLAH CHOWDHURY
ADVISOR : KHONDKER SHAJADUL HASANCO – ADVISOR : JAVED SIDDIQUE
DATA INTEGRATION DATA INTEGRATION
METAQUERIER METAQUERIER
ARCHITECTUREARCHITECTURE
BIOMEDICAL DATABIOMEDICAL DATA
Three basic parts of the project
DATA DATA
INTEGRATION INTEGRATION What does it mean?
Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data. This process emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories) .
Data integration appears with increasing frequency as the volume and the need to share existing data explodes.
Simple schematic for a data warehouse. The information from the source databases is extracted, transformed then loaded into the data warehouse
DATA DATA
INTEGRATION INTEGRATION
DATA DATA
INTEGRATION INTEGRATION Difficulties of Data Integration
Huge web database
Database content are now dynamic
Necessity of efficient data crawler
Accurate and perfect Query Interfaces
Time efficiency
Depth
Volume of data handling
ImportanceIntegration from web databases.
In order to get necessary information from different sources data integration is
very important.
In order to get a Large scale Integration
Efficient and accurate query answers.
Consider a user, who is moving to a new town. To start with, different queries
need different sources to answer: Where can she look for real estate
listings? (e.g., realtor.com.) Studying for a new car? (cars.com.)
Looking for a job? (monster.com.) Further, different sources support different
query capabilities: After source hunting, the user must then learn the
gruelling details of querying each source.
DATA DATA
INTEGRATION INTEGRATION
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE There are different approaches and paradigms for data integration, some of
them are-
• Materialized: physical, integrated repository is created here.
• Data Warehouses: physical repositories of selected data extracted from a
collection of DBs and other information sources.
• Mediated: data stay at the sources, a virtual integration system is created.
• Federated and cooperative: DBMSs are coordinated to collaborate.
• Exchange: Data is exported from one system to another.
• Peer-to-Peer data exchange: Many peers exchange data without a central
control mechanism. Data is passed from peer to peer upon request,
as query answers.
Two basic concerns to use MetaQuerier
First, to make the deep Web systematically
accessible, it will help users find online
databases useful for their queries
To make the deep Web uniformly usable, it will
help users query online databases.
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE MetaQuerier Architecture has two basic stands
Dynamic DiscoveryAs Sources are Changing so they
must be dynamically discovered for integration. There are no preselected sources
On the Fly IntegrationAs queries are ad-hoc, so
MetaQuerier must mediated them on the fly for relevant sources. There is not pre configured sources
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE PARTS OF METAQUERIER SYSTEM
Front end
Back end
Deep web Repository
Results CompilationQuery TranslationSource Selection
Database Crawler, Source Clustering, Schema Matching, Interface Extraction
METAQUERIER METAQUERIER
ARCHITECTURE ARCHITECTURE PROCESSES OF THE METAQUERIER SYSTEM ARCHITECTURE
DATA CRAWLER
INTERFACE EXTRACTION
SOURCE CLUSTERING
SCHEMA MATCHING
RESULT COMPILATION
QUERY TRANSLATION
SOURCE SELECTION
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Data CrawlerData Crawler
A process to gather certain information from the web
database and other resources.
Similar to Web Crawling – A process used by search engines
to search on Internet as queried.
Data Crawler search data by filtering and categorizing to
make the system efficient.
It has two different segments –
Site Crawler
Shallow Crawler
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Data CrawlerData Crawler
Workflow:
oSite Crawler needs efficient query interface.
oIt takes user querable keywords from the interface and filters
the query.
oSite Crawler goes through the root page.
oIt identifies IP addresses.
oShallow Crawler follows those found IP addresses.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Data CrawlerData Crawler
Advantages
Two major challenges can be accomplished through
data
crawling.
Dynamic Discovery
Deep web searching.
Dynamic Discovery is covered through Site Crawler.
Deep web searching is covered through Shallow
Crawler.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Interface ExtractionInterface Extraction
Interface Extraction extracts data the required data from the query
interfaces.
Query interface share similar query patterns but sometimes different.
Different query patterns arise due to hidden information or attributes.
These attributes are not visual on interface.
Workflow:
Data Crawler hands over a huge amount of unsorted and hidden data.
IE generates a query which extracts the found data.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Interface ExtractionInterface Extraction
Key Features
It takes query interfaces in HTML format.
Then it functions as a visual language parser.
Interface Extraction tokenizes the page, parses the tokens and
then merges potentially multiple trees.
Finally it generates the query capabilities.
The basic idea of interface extraction is to extract query
capabilities from query interfaces.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Source SelectionSource Selection
Defined a common mediated schema for all data sources, we
need to match and map the data sources according to mediated
schema.
The target user may understand the concepts in their own
domain but may not know what on other domains.
The solution is to set the sources to include in data integration
and what mediated schema to use.
All ontologies are stored in a common repository.
The system identifies which ontology will be used based on the
user submitted query.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE Result CompilationResult Compilation
Last process of the data integration.
It aggregates query results to the user.
It compiles data results from different sources into coherent
pieces.
Will be used for extracting data from schema matching and
matching other attributes across different sources.
Source ClusteringSource Clustering
• Collaborates with source selection which works in the front-end.
• Clusters sources according to subject domain (e.g. edu, org etc).
• Sorts data as mediated process which provides data towards schema process.
• Main task is to construct a hierarchy of clusters with a given set of query capabilities.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Source Clustering (Cont.)Source Clustering (Cont.)CHARACTERISTICS OF DOMAIN ELEMENTS AND CONSTRAINT ELEMENTS:
• Textboxes cannot be used for constraint elements.
• Radio buttons or checkboxes or selection lists may appear as
constraint elements.
• An attribute consists of a single element cannot have constraint
elements.
• An attribute consisting of only radio buttons or checkboxes does
not have constraint elements.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Source Clustering (Cont.)Source Clustering (Cont.)
HOW TO DIFFERENTIATE BETWEEN DOMAIN & CONSTRAINT ELEMENTS: A simple two-step method can be used:
1. First, identify the attributes that contain only one element or whose elements are all radio buttons, or checkboxes or textboxes.
2. Second, an Element Classifier is needed to process other attributes that may contain both domain elements and constraint elements. Each element is represented as a feature of four: element name, element format type, element relative position in the element list, and element values.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Source Clustering (Cont.)Source Clustering (Cont.)DERIVING INFORMATION FROM ATTRIBUTES:
Four types of information for each attribute are defined (only for domain elements):
1. Domain type: Indicates how many distinct values can be used for an attribute for queries. There are four domain types are defined in our model:
range finite infinite Boolean
2. Value type: Each attribute on a search interface has its own semantic value type.
All input values are treated as text values
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Source Clustering (Cont.)Source Clustering (Cont.)
3. Default Value: Indicate some semantics of the attributes. May occur in a selection list, a group of radio buttons and a group of checkboxes. Always marked as “checked” or “selected” .
4. Unit: Defines the meaning of an attribute value e.g., kilogram is a unit for weight.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Schema MatchingSchema Matching
• Schema defines the tables, the fields in each table, and the
relationships
between fields and tables.
• It is the graphical representation of a database structure.
• Schema matching is the process of identifying two objects whether
they are semantically related or not while mapping refers to the
transformations between the objects.
• In data integration, schema matching finds out the semantic domain
values among the attributes, which have been found through query
interfaces.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Schema MatchingSchema Matching
• Uses data from query capabilities and organize the data as per
requirement.
• It provides data to Source selection and Query Translation and
finally sends the data to users at the front-end.
• MetaQuerier redesigns the process in terms of complex matching
instead of one by one process.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Schema MatchingSchema Matching
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Query TranslationQuery Translation
• A front end process.
•Translation is necessary to match and express query conditions in
terms of what an interface sends.
•It is critical to automatically interpret queries
Steps For complete query translation:
Step 1: extract constraint templates from a query interface.
Step 2: find matching templates from given source and target constraint templates
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
Query TranslationQuery Translation
Constraint mapping:
• The objective is to find the target constraint with the closest semantic meaning
to the source constraint.
Query mediation:
• Mediating queries across multiple sources.
• Abstract the problem as a pattern of answering query using views.
• Focus is to decompose a user query into sub-queries across multiple sources.
Schema mapping:
• Translates a set of data values from one source to another one, according to
given matching.
• Only concerns about the equality relation between different schemas.
PROCESSES OF THE METAQUERIER PROCESSES OF THE METAQUERIER
ARCHITECTURE ARCHITECTURE
BIO MEDICAL DATA - - BIO MEDICAL DATA - -
PROTEINPROTEINWHAT IS PROTEIN:
Any of a large group of nitrogenous organic compounds that are essential constituents of living cells; consist of polymers of amino acids; essential in the diet of animals for growth and for repair of tissues; can be obtained from meat and eggs and milk and legumes
TYPE OF
MACRO MOLECULE SUPER MOLUCULE
PART OF
AMINO ACID AMINO ALCANICACID PLOYPEPTIDE
BIO MEDICAL DATA - - PROTEINBIO MEDICAL DATA - - PROTEIN
SOME EXAMPLE OF PROTEIN INFORMATION
BIO MEDICAL DATA - - PROTEINBIO MEDICAL DATA - - PROTEIN
AVAILABLE WEB SERVICES ABOUT PROTEIN
Source Clustering Source Clustering (Example)(Example)
DERIVING INFORMATION FROM ATTRIBUTES:
1. Domain type: range, finite, infinite and Boolean
Here, two textboxes are used to represent a range for the attribute Production Year, thus the attribute should have range domain type.
Source Clustering Source Clustering (Example)(Example)2. Value type: Distinct Values
For example, the attribute Onlooker’s age or Reader age semantically has integer values, and Production date has date values
Source Clustering (Cont.)Source Clustering (Cont.)3. Default Value:
In the previous figure, the attribute Onlooker’s age has a default value
“all ages”
4. Unit:
one search interface may use “Milligrams/grams” as the unit of its
Concentration attribute, while another may use “Litres” for its Concentration
attribute.
Query Translation Query Translation (Example)(Example)
Two Bio-Medical Data query interfaces and their matching
• Name of Bio-Medical data – Proteins
• Constraint templates –
Look at the interfaces
T1: name T2: source T3: onlooker’s age T4: concentration
S1: name S2: category S3: concentration; [between; $low, $high] S4: onlooker’s age;[ in; {[18:65],…}]
Query Translation Query Translation (Example)(Example)
S1: name S2: category S3: concentration; [between; $low, $high] S4: onlooker’s age;[ in; {[18:65],…}]
T1: name T2: source T3: onlooker’s age T4: concentration
Focus is to translate between “matching” constraint templates S2 in Q1
matches T2 in Q2.
We need to extract constraint templates (T1,…,T4) .
Given source and target constraint templates (Q1 and Q2 respectively), we need
to find matching templates.
Query Translation Query Translation (Example)(Example)
Constraint mapping across Query Interfaces (Q1 and Q2)
• Constraint mapping is to instantiate T2 into t2 = [source; all words;
"Membrane Protein"]
The best translation of the source constraint s2, i.e., s2 t2
Query Translation Query Translation (Example)(Example)
Translation rules T12 between Q1 and Q2
To translate queries we need the following mapping techniques:
r1 [category; contain; $s] emit: [source; all; $s]
r2 [name; contain; $t] emit: [name; contain; $t]
r3 [concentration range; between; $s, $t] $p = ChooseClosestNum($s), emit: [concentration; less than; $p]
r4 [onlooker’s age; between; $s] $r = ChooseClosestRange($s), emit: [age; between; $r]
Table: Translation Rules
Query Translation Query Translation (Example)(Example)
Text type constraints operators: any, all, exact, start and string values, Numeric type constraints: equal, greater than, less than, between and numeric values.
The constraint mapping framework
Source constraint s and a target constraint template T
• Gives output to the closest target constraint topt, that T can generate
to s.
• The type recognizer identifies the type of the constraints, and then
dispatches them accordingly to the type handler.
Query Translation Query Translation (Example)(Example)
• The type handler performs the search to find a good instantiation among possible ones described by T, which is then returned as the mapping.
•The type recognizer takes the source constraint s and target constraint
template T as input, and infers the data type by analyzing the constraints
syntactically.
• The type handler takes the constraints dispatched by the type recognizer
as input and performs search among possible instantiations of the target
constraint template for the best one.
Query Translation Query Translation (Example)(Example)
Mapping the constraints between category in Q1 and source in Q2:
• Source constraint s = [category; contain; "Membrane Protein"] is instantiated from
template S = [category; contain; $val] by populating $val=" Membrane Protein"
• Target constraint template T = [source; $op; $val] accepts operators $op from
{"any words", "all words"} and value $val from any string.
Query Translation Query Translation (Example)(Example)
t1[source; any; “Membrane Protein”]
t2[source; all; “Membrane Protein”]
t3[source; any; “Membrane”]
t4[source; any; “Protein”]
• Among the candidate target constraints t1, t2, . . . , from I(T), the constraint mapping thus searches for the element that is closest to the source.
EXAMPLE OF AN INTERFACEEXAMPLE OF AN INTERFACE
“Invention date” implies the Attribute is semantically a date data type.Two elements are used to specify a range query condition with
different roles in specifying the condition.Such semantic information is hidden from computers.Not defined on query interfaces.This HIDDEN information about each attribute needs to be revealed
and defined to enrich the schema matching.
EXAMPLE OF AN INTERFACEEXAMPLE OF AN INTERFACE
OVERVIEW OF THE SYSTEMOVERVIEW OF THE SYSTEM
SOURCES ARE NOT PREDEFINED AND PRE
CONFIGURED. SO NEED TO FIND SOURCES
DYNAMICALLY ACCORDING TO THE USER AD HOC
INFORMATION
AFTER DISCOVERY OF THE WEB DATABASES ITS IS
NEEDED TO EXTRACT THE QUERY CAPABILITIES AND
ITS IS ALSO AUTOMATIC AND ON THE FLY
THEN QUERYING THE SOURCES TRANSLATE THE
QUERY ON THE FLY SINCE SOURCE ARE UNSEEN
OVERVIEW OF THE SYSTEMOVERVIEW OF THE SYSTEM
WORK FLOW OF THE SYSTEM
BACK END SEMANTIC DISCOVERY- DATA CRAWLER
- automatically collect sources from the deep web- INTERFACE EXTRACTION
- Extract query capabilities from interface- SOURCE CLUSTERING
- Clustering interface into sub domain- SCHEMA MATCHING
- Discover semantic matchingFRONT END EXECUTION OF QUERY
- PROVIDE USER A DOMAIN CATEGORY- FOR EACH CATEGORY A UNIFIED INTERFACE IS GENERATED BY SM- SELECT APPROPRIATE SOURCES TO RUN QUERY (SS)- SELECTED SOURCES ARE TRANSLATED BY QUERY TRANSLATION- FINALLY AGGREGATE THE RESULT BY RESULT COMPILATION
CONCLUSIONCONCLUSION
Our target is to deploy MetaQuerier as an efficient data integration architecture.
The implementation can be done successfully based on Bio Medical Data
Inside the subsystem of MetaQuerier there are some conceptual changes can be done to improve the efficiency of handling huge unsorted data
THANK YOU THANK YOU
ANY QUESTIONANY QUESTION
??